In CLDR 30/31, there was a serious regression in the zh stroke collation. For example, the following common characters/radicals were missing in the stroke collation:
This is related to changes for the following tickets:
: UCA 9, r12662
: BRS update tasks, r12930
There is likely some problem in the tooling or the way it was run.
I am checking to see whether the problem is still present in CLDR 32 data, updated per : Unicode 10
This fix is for tooling, and is too much for CLDR 32 at this point. In CLDR 32 we will have revert to CLDR 29 stroke collation to address the problem, per :. The real fix in tooling is per this bug, which is moving to CLDR 33.
The problem was introduced in r12662
The problem is due to change r 1047 in the unicodetools project, part of the changes for : "UCA 9", but of course the unicodetools portion of the changes do not show in the review link for that ticket. The relevant change is in unicodetools/trunk/unicodetools/org/unicode/draft/GenerateUnihanCollators.java, in RSComparator.compare (line 1429); this is used by StrokeComparator.compare to provide a result when two characters have the same stroke count. In the version used for CLDR 29 it looked like this:
That is, if RsInfo.getSortOrder returned the same value for both characters, the method used the codepoint to distinguish them. In the version used for CLDR 30 and later it looks like this:
If getRSLongOrder returns the same result for both characters, there is no longer a fallback to a difference based on code point order.
In showSorting, in the first loop over Strings from unicodeMap, at line 875, rsSorted.add(s) is called to add strings. However they are not added if the comparator indicates they are equal to something already in rsSorted; and this is what I now see happening for the characters that have gone missing. For example;
rsSorted.add is called for 2-stroke \u2E86 ⺆, adds it successfully, then
rsSorted.add is called for 2-stroke \u5182 冂 (the more important char), does NOT add because the comparator treats it as equal to \u2E86
And in fact in GenerateUnihanCollators console log, after it writes out the strokeT files, I see output like the following; not sure what this is supposed to indicate, but the middle character (at least) in each of these groups is not getting added to rsSorted, and thus not written out as part of the collation:
So: It seems like a reasonable fix is to restore the code point comparison as a fallback in RSComparator.compare, and THEN run the tools to generate the updated Unicode 11 collation and transform data for CLDR 33.1.
I ran this by Markus, who suggested that I make the change, verify that it fixes the problem, and commit the unicodetools change, but not the updated data (replacing the current stroke data which is from CLDR 29 and has the characters that were missing in CLDR 30). He will run all of the tools for the Unicode 11 data update.
OK, I made the change as r 1468 in the unicodetools project. The diff is:
And running GenerateUnihanCollators with this chnage:
It now generates the zh stroke collations (the strokeT files) with the characters that were formerly missing
It also adds characters to the pinyin and unihan collations, and to the generated kMandarin.txt and kTotalStrokes.txt files
And it eliminates all of the GenerateUnihanCollators console log output that had lines prefixed with "unihan: ", e.g.