Fixed
Details
Details
Assignee
Markus Scherer
Markus SchererReporter
Markus Scherer
Markus SchererComponents
Labels
Priority
Time Needed
Hours
Fix versions
Created June 28, 2018 at 5:19 PM
Updated October 3, 2018 at 10:54 PM
Resolved July 1, 2018 at 8:50 PM
COllate Digits As Numbers (UCOL_NUMERIC_COLLATION, setNumericCollation()) generates primary weights with the same lead byte as a regular digit zero. The problem is that there are non-decimal-digit characters that share that same single-byte primary weight, for example U+24EA CIRCLED DIGIT ZERO. As a result, U+24EA's primary is a prefix of the primary of all CODAN sequences.
For example, try to sort the following lines with a CODAN Collator (e.g. via the online ICU collation demo):
0 1 10 100 1000 1000000000000 \u24EA? \u24EA\u4E00
The "⓪?" sorts before all of the CODAN numbers listed here, while the "⓪一" sorts after them.
We need to use a separate primary lead byte for CODAN, we need to enter it into the "inverse UCA" table to prevent tailoring it, and therefore we need another lead byte as a gap. For example, the new CODAN lead byte could be followed by one new gap byte followed by the byte for decimal digit zero.
So we need to modify FractionalUCA.txt and coordinate with the CODAN runtime code.
We could add new FractionalUCA syntax to specify the CODAN byte.
It might work, and be simpler, to combine this with adding first-in-script entries to the FractionalUCA, make the first-digits weight be a single byte, and use that for CODAN. That would minimize the use of primary weight lead bytes.