CODAN generates ill-formed primary weights

Description

COllate Digits As Numbers (UCOL_NUMERIC_COLLATION, setNumericCollation()) generates primary weights with the same lead byte as a regular digit zero. The problem is that there are non-decimal-digit characters that share that same single-byte primary weight, for example U+24EA CIRCLED DIGIT ZERO. As a result, U+24EA's primary is a prefix of the primary of all CODAN sequences.

For example, try to sort the following lines with a CODAN Collator (e.g. via the online ICU collation demo):

0 1 10 100 1000 1000000000000 \u24EA? \u24EA\u4E00

The "⓪?" sorts before all of the CODAN numbers listed here, while the "⓪一" sorts after them.

We need to use a separate primary lead byte for CODAN, we need to enter it into the "inverse UCA" table to prevent tailoring it, and therefore we need another lead byte as a gap. For example, the new CODAN lead byte could be followed by one new gap byte followed by the byte for decimal digit zero.

So we need to modify FractionalUCA.txt and coordinate with the CODAN runtime code.

We could add new FractionalUCA syntax to specify the CODAN byte.

It might work, and be simpler, to combine this with adding first-in-script entries to the FractionalUCA, make the first-digits weight be a single byte, and use that for CODAN. That would minimize the use of primary weight lead bytes.

Activity

UnicodeBot 
June 30, 2018 at 11:50 PM

Trac Comment 8 by —2014-03-14T05:42:07.240Z

Tested, works correctly.

UnicodeBot 
June 30, 2018 at 11:50 PM

Trac Comment 6 by —2013-09-13T18:13:38.649Z

This is done in my collv2 branch and in the FractionalUCA.txt it uses.

UnicodeBot 
June 30, 2018 at 11:50 PM

Trac Comment 4 by —2012-12-30T17:57:48.702Z

Another problem with sharing a first-in-digits-group primary: We now plan to enter the script-first primaries as regular mappings from strings (U+FDD1 + script sample character) into the collation table, for use in !AlphabeticIndex. (See CldrBug:5129) If numeric sorting uses the same lead byte as the first-digit string, then we are back to illegal prefix overlaps between that string and whole-number CEs.

We need to reserve a separate lead byte for numeric sorting, one that is in the invuca table but not by itself reachable via a string. We can add it to FractionalUCA.txt with a string like U+FDD0 + '4' since genuca enters all strings starting with FDD0 only into the invuca table.

UnicodeBot 
June 30, 2018 at 11:50 PM

Trac Comment 3 by —2012-08-31T21:29:17.639Z

Potential problem with sharing a first-in-digits-group primary: If we allow tailoring secondary (or weaker) after the first-digit value then that tailored CE would still render CODAN CEs invalid.

The simplest might be to not allow tailoring to the first-in-script CEs. (Don't add syntax to do so.)

Fixed

Details

Assignee

Reporter

Components

Priority

Time Needed

Hours

Fix versions

Created June 28, 2018 at 5:19 PM
Updated October 3, 2018 at 10:54 PM
Resolved July 1, 2018 at 8:50 PM