LDML spec: clarify FractionalUCA tertiary weight field

Description

Some of the character to collation element mappings in FractionalUCA.txt sort in a different order than the corresponding mappings in allkeys_CLDR.txt. For example the allkeys collation elements for U+FF64 and U+FE11 will sort U+FF64 before U+FE11, however, the FractionalUCA collation elements for these same characters will sort U+FF64 after U+FE11. Compare their collation elements in the table below:

 

FractionalUCA.txt

allkeys_CLDR.txt

U+FF64

[07 22, 05, 93]

[0136.0020.0012]

U+FE11

[07 22, 05, 2C]

[0136.0020.0016]

Notice how both U+FF64 and U+FE11 have the same primary and secondary weights, but different tertiary weights. Notice how the FractionalUCA tertiary weights sort U+FE11 < U+FF64 (because 2C < 9E) but allkeys sorts them as U+FE11 > U+FF64 (because 0016 > 0012).

Additionally, in a few cases, the tertiary weights for pairs of characters differ in FractionalUCA but are identical for the same characters in allkeys. The table below illustrates that the tertiary weight for U+018A and U+004A are different in FractionalUCA, but identical in allkeys. I would think if the tertiary weights differ in one table, then they should differ in the other. This particular test fails because of this discrepancy with a collation element table built with FractionalUCA, but passes using a table built with allkeys.

 

FractionalUCA.txt

allkeys_CLDR.txt

U+018A

[31 12, 05, A0]

[2100.0020.0008]

U+0041

[2A, 05, 9C]

[20A9.0020.0008]

Activity

Show:
Markus Scherer
October 13, 2023 at 3:28 PM

FYI: We have collation libraries based on CLDR and its FractionalUCA.txt and tailoring data in C/C++ (ICU4C), Java (ICU4J), Rust (ICU4X), and there is an up-to-date Python wrapper (PyICU).

Markus Scherer
October 13, 2023 at 3:25 PM

You are welcome

It depends. If there is a problem/gap in the CLDR documentation / LDML spec, then this is the right place. I suggest that we keep this ticket open for that. I will rename it.

If there is a problem with a Unicode specification other than CLDR/ICU/ICU4X, for example UTS #10=UCA, then you can report it here: https://www.unicode.org/reporting.html

If you just have some questions, then you can send an email to one of our mailing lists: https://www.unicode.org/consortium/distlist.html

Henry Stratmann
October 13, 2023 at 5:32 AM

Thank you for the detailed write up. I was able to resolve my related failures. This issue can be closed.

The insights you’ve provided were extremally helpful. Is there a preferred place, a mailing list perhaps, where one might go to discuss the Unicode technical reports or suggest clarification in the documentation?

Markus Scherer
October 13, 2023 at 3:05 AM

The data is fine. The documentation could use some work.

The FractionalUCA tertiary weight is really a combination of two fields with 2 bits and 6 bits respectively. As the documentation says, way too briefly: “The tertiary weight actually consists of two components: the top two bits (0xC0) are used for the case level, and should be masked off where a case level is not used.“

So the tertiary weight 0x93 is really case bits 0x80 + real tertiary 0x13. Now, U+FF64 is the halfwidth ideographic comma, which is not actually uppercase, but there are a couple of other distinctions encoded in the two case bits, and “width variant” is probably one of them. I would have to look at the data generator to remind myself, and it would be good to document it.

Anyway, when you have caseLevel=off and caseFirst=off (which are the default settings), then the case bits are ignored, and FF64 → [07 22, 05, 13] sorts tertiary-before FE11 → [07 22, 05, 2C], as expected.


The tertiary weights are fine, too.

The DUCET allkeys.txt file assigns fixed tertiary weight values (documented in UTS #10) for certain types of characters (e.g., by case or width).

However, for collation the absolute values are mostly irrelevant. Things that are equal must have equal weights, and otherwise only the relative order of weights matters.

For CLDR and ICU, we generate an optimized version of this data.

For each primary weight, we spread the secondary weights that occur in the data so that we get maximal gaps for tailoring secondary distinctions. (The more secondary weights per primary, the smaller the gaps.)

For each combination of primary+secondary weight, we spread the tertiary weights that occur in the data.

In particular, for some primary+secondary combinations, there might be only one non-default (>0002) allkeys fixed tertiary weight, and no other collation element with the default (0002) tertiary weight. Since it will never be compared with another tertiary value of any root-table collation element, we are free to choose a tertiary weight. In such a case, we choose the default weight (05 in FractionalUCA.txt) because that leads to a more compact encoding in ICU.

In your example (with data from current “main”):

We need to distinguish these three characters on tertiary level, but the specific tertiary weights don’t matter.

And:

Again, we need to distinguish among these, and we do, in the expected order. (Remember to strip the case bits again for default settings, so the real tertiary of U+0041 is 0x1C.)

However, we will never compare the tertiaries of U+018A and U+0041 with each other, because collation considers the primary and secondary levels before tertiary, and the primary+secondary CE parts of these two characters differ ([31 12, 05, xx] vs. [2A, 05, xx]).

Also, ICU uses the FractionalUCA.txt data file and passes the (CLDR versions of the) UCA conformance test files


I recommend that we add details of the FractionalUCA.txt tertiary weight values to the documentation. I can try to find time to do so.

Details

Priority

Assignee

Reporter

Components

Created October 13, 2023 at 1:59 AM
Updated October 13, 2023 at 3:29 PM