dx syntax is not clear

Description

it mentions

“

A Unicode Dictionary Break Exclusion Identifier specifies scripts to be excluded from dictionary-based text break (for words and lines). The valid values are of one or more items of type SCRIPT_CODE as specified in the name attribute value in the type element of key name="dx" in bcp47/segmentation.xml.
This affects break iteration regardless of locale.

“

But It is not clear how to specify “one or more items” in the sayntax of the type of a dx key keyword

for example, if we like to have a locale of zh and want to exclude the dictionary break for Thai (Thai), Khmer (Khmr) and Lao (Laoo) script

Should that locale id be

A. zh-u-dx-khmrlaoo-thai ?

B. zh-u-dx-thai-khmr-laoo?

C. zh-u-dx-khmr-dx-laoo-dx-thai ?

D. zh-u-dx-thai-dx-khmr-dx-laoo ?

E. zh-u-dx-khmr-laoo-thai?

or all are acceptable?

Linked work items

is cloned by

CLDR-17194

Resolve remaining questions about dx syntax

relates to

ICU-13219

add word and line BreakIterator option -u-dx- for dictionary exclusion

CLDR-17152

BRS Update spec modification section

Activity

Mark Davis
January 20, 2024 at 12:29 AM

The last comment was addressed in email, so reclosing.

Frank Yung-Fong Tang
November 20, 2023 at 8:13 PM

I found another issue when I try to implement

in ICU

the usage of “dx-zyyyy”

According to it currently said

“The code Zyyy (Common) can be specified to exclude all scripts, in which case it should be the only SCRIPT_CODE value specified. If others are included mistakenly, they are ignored.“

But this is very troublesome. Why? because wikipedia said Script Zyyy is denoting a set of 8,306 Unicode characters which not characters of all scripts.

now you may argue wikipedia is wrong. But is it?

For example, if we look at

it said U+30FC and U+30FB match sc Zyyy = Common , it is clear that Zyyy is used to match a clearly defined set of Unicode characters, NOT “all scripts” in that context. Using Zyyy to denote all scripts in the UTS35 for u-dx therefore create a second and different interpretation of Zyyy.

said

Zyyy

998

Code for undetermined script

notice, most characters in all scripts are NOT “undetermined” . Characters inside Thai script is determined to be in Thai script, and therefore does not match the defininition of “undetermined”. Therefore, using zyyy to denote the concept of “specified to exclude all scripts” is very troublesome.

Mark Davis
October 25, 2023 at 6:57 AM

What the spec should specify is that

Since it is a set, any order is equivalent.
However, the canonical bcp47 order is alphabetical

Steven R. Loomis
October 23, 2023 at 5:09 PM

B. zh-u-dx-thai-khmr-laoo

As to the sorting, does not specify it that I could see.

I am going to say LGTM on this overall as to the original scope, and let you respond on the ordering. You can move this to fixed or clone it etc.

Frank Yung-Fong Tang
October 5, 2023 at 6:50 PM

would there be any requirement in the canonical process to sort the script codes?

so zh-u-dx-thai-khmr-laoo would be turn into zh-u-dx-khmr-laoo-thai ?
or stay as zh-u-dx-thai-khmr-laoo ?

Resize issue view side panel

Fixed

Details

Priority

major

Assignee

Mark Davis

Reporter

Frank Yung-Fong Tang

Reviewer

Steven R. Loomis

Fix versions

Components

Labels

Created July 28, 2023 at 3:24 AM

Updated January 20, 2024 at 12:29 AM

Resolved January 20, 2024 at 12:29 AM