dx syntax is not clear
Description
is cloned by
relates to
Activity
Mark Davis January 20, 2024 at 12:29 AM
The last comment was addressed in email, so reclosing.

Frank Yung-Fong Tang November 20, 2023 at 8:13 PM
I found another issue when I try to implement
in ICU
the usage of “dx-zyyyy”
According to it currently said
“The code Zyyy (Common) can be specified to exclude all scripts, in which case it should be the only SCRIPT_CODE value specified. If others are included mistakenly, they are ignored.“
But this is very troublesome. Why? because wikipedia said Script Zyyy is denoting a set of 8,306 Unicode characters which not characters of all scripts.
now you may argue wikipedia is wrong. But is it?
For example, if we look at
it said U+30FC and U+30FB match sc Zyyy = Common , it is clear that Zyyy is used to match a clearly defined set of Unicode characters, NOT “all scripts” in that context. Using Zyyy to denote all scripts in the UTS35 for u-dx therefore create a second and different interpretation of Zyyy.
said
Zyyy | 998 | Code for undetermined script |
notice, most characters in all scripts are NOT “undetermined” . Characters inside Thai script is determined to be in Thai script, and therefore does not match the defininition of “undetermined”. Therefore, using zyyy to denote the concept of “specified to exclude all scripts” is very troublesome.
Mark Davis October 25, 2023 at 6:57 AM
What the spec should specify is that
Since it is a set, any order is equivalent.
However, the canonical bcp47 order is alphabetical

Steven R. Loomis October 23, 2023 at 5:09 PM
B. zh-u-dx-thai-khmr-laoo
As to the sorting, does not specify it that I could see.
I am going to say LGTM on this overall as to the original scope, and let you respond on the ordering. You can move this to fixed or clone it etc.

Frank Yung-Fong Tang October 5, 2023 at 6:50 PM
would there be any requirement in the canonical process to sort the script codes?
so zh-u-dx-thai-khmr-laoo would be turn into zh-u-dx-khmr-laoo-thai ?
or stay as zh-u-dx-thai-khmr-laoo ?
In
it mentions
“
A Unicode Dictionary Break Exclusion Identifier specifies scripts to be excluded from dictionary-based text break (for words and lines). The valid values are of one or more items of type SCRIPT_CODE as specified in the name attribute value in the type element of key name="dx" in bcp47/segmentation.xml.
This affects break iteration regardless of locale.
“
But It is not clear how to specify “one or more items” in the sayntax of the type of a dx key keyword
for example, if we like to have a locale of zh and want to exclude the dictionary break for Thai (Thai), Khmer (Khmr) and Lao (Laoo) script
Should that locale id be
A. zh-u-dx-khmrlaoo-thai ?
or
B. zh-u-dx-thai-khmr-laoo?
or
C. zh-u-dx-khmr-dx-laoo-dx-thai ?
or
D. zh-u-dx-thai-dx-khmr-dx-laoo ?
or
E. zh-u-dx-khmr-laoo-thai?
or all are acceptable?