dx syntax is not clear

Description

In

it mentions

A Unicode Dictionary Break Exclusion Identifier specifies scripts to be excluded from dictionary-based text break (for words and lines). The valid values are of one or more items of type SCRIPT_CODE as specified in the name attribute value in the type element of key name="dx" in bcp47/segmentation.xml.
This affects break iteration regardless of locale.

But It is not clear how to specify “one or more items” in the sayntax of the type of a dx key keyword

for example, if we like to have a locale of zh and want to exclude the dictionary break for Thai (Thai), Khmer (Khmr) and Lao (Laoo) script

Should that locale id be

A. zh-u-dx-khmrlaoo-thai ?

or

B. zh-u-dx-thai-khmr-laoo?

or

C. zh-u-dx-khmr-dx-laoo-dx-thai ?

or

D. zh-u-dx-thai-dx-khmr-dx-laoo ?

or

E. zh-u-dx-khmr-laoo-thai?

or all are acceptable?

Activity

Mark Davis 
January 20, 2024 at 12:29 AM

The last comment was addressed in email, so reclosing.

Frank Yung-Fong Tang 
November 20, 2023 at 8:13 PM

I found another issue when I try to implement

in ICU

the usage of “dx-zyyyy”

According to it currently said

“The code Zyyy (Common) can be specified to exclude all scripts, in which case it should be the only SCRIPT_CODE value specified. If others are included mistakenly, they are ignored.“

But this is very troublesome. Why? because wikipedia said Script Zyyy is denoting a set of 8,306 Unicode characters which not characters of all scripts.

now you may argue wikipedia is wrong. But is it?

For example, if we look at

it said U+30FC and U+30FB match sc Zyyy = Common , it is clear that Zyyy is used to match a clearly defined set of Unicode characters, NOT “all scripts” in that context. Using Zyyy to denote all scripts in the UTS35 for u-dx therefore create a second and different interpretation of Zyyy.

said

Zyyy

998

Code for undetermined script

notice, most characters in all scripts are NOT “undetermined” . Characters inside Thai script is determined to be in Thai script, and therefore does not match the defininition of “undetermined”. Therefore, using zyyy to denote the concept of “specified to exclude all scripts” is very troublesome.

Mark Davis 
October 25, 2023 at 6:57 AM

What the spec should specify is that

  • Since it is a set, any order is equivalent.

  • However, the canonical bcp47 order is alphabetical

Steven R. Loomis 
October 23, 2023 at 5:09 PM

B. zh-u-dx-thai-khmr-laoo

As to the sorting, does not specify it that I could see.

I am going to say LGTM on this overall as to the original scope, and let you respond on the ordering. You can move this to fixed or clone it etc.

Frank Yung-Fong Tang 
October 5, 2023 at 6:50 PM

would there be any requirement in the canonical process to sort the script codes?

so zh-u-dx-thai-khmr-laoo would be turn into zh-u-dx-khmr-laoo-thai ?
or stay as zh-u-dx-thai-khmr-laoo ?

Fixed

Details

Priority

Assignee

Reporter

Reviewer

Fix versions

Components

Labels

Created July 28, 2023 at 3:24 AM
Updated January 20, 2024 at 12:29 AM
Resolved January 20, 2024 at 12:29 AM