CODAN generates ill-formed primary weights

General

Trac Data

Other Data

General

Trac Data

Other Data

Description

COllate Digits As Numbers (UCOL_NUMERIC_COLLATION, setNumericCollation()) generates primary weights with the same lead byte as a regular digit zero. The problem is that there are non-decimal-digit characters that share that same single-byte primary weight, for example U+24EA CIRCLED DIGIT ZERO. As a result, U+24EA's primary is a prefix of the primary of all CODAN sequences.

For example, try to sort the following lines with a CODAN Collator (e.g. via the online ICU collation demo):

0
1
10
100
1000
1000000000000
\u24EA?
\u24EA\u4E00

The "⓪?" sorts before all of the CODAN numbers listed here, while the "⓪一" sorts after them.

We need to use a separate primary lead byte for CODAN, we need to enter it into the "inverse UCA" table to prevent tailoring it, and therefore we need another lead byte as a gap. For example, the new CODAN lead byte could be followed by one new gap byte followed by the byte for decimal digit zero.

So we need to modify FractionalUCA.txt and coordinate with the CODAN runtime code.

We could add new FractionalUCA syntax to specify the CODAN byte.

It might work, and be simpler, to combine this with adding first-in-script entries to the FractionalUCA, make the first-digits weight be a single byte, and use that for CODAN. That would minimize the use of primary weight lead bytes.

Linked work items

relates to

ICU-9552

numeric collation (CODAN) should go before digit 0

Activity

UnicodeBot
June 30, 2018 at 11:50 PM

Trac Comment 8 by @Peter Edberg—2014-03-14T05:42:07.240Z

Tested, works correctly.

UnicodeBot
June 30, 2018 at 11:50 PM

Trac Comment 6 by @Markus Scherer—2013-09-13T18:13:38.649Z

This is done in my collv2 branch and in the FractionalUCA.txt it uses.

UnicodeBot
June 30, 2018 at 11:50 PM

Trac Comment 4 by @Markus Scherer—2012-12-30T17:57:48.702Z

Another problem with sharing a first-in-digits-group primary: We now plan to enter the script-first primaries as regular mappings from strings (U+FDD1 + script sample character) into the collation table, for use in !AlphabeticIndex. (See CldrBug:5129) If numeric sorting uses the same lead byte as the first-digit string, then we are back to illegal prefix overlaps between that string and whole-number CEs.

We need to reserve a separate lead byte for numeric sorting, one that is in the invuca table but not by itself reachable via a string. We can add it to FractionalUCA.txt with a string like U+FDD0 + '4' since genuca enters all strings starting with FDD0 only into the invuca table.

UnicodeBot
June 30, 2018 at 11:50 PM

Trac Comment 3 by @Markus Scherer—2012-08-31T21:29:17.639Z

Potential problem with sharing a first-in-digits-group primary: If we allow tailoring secondary (or weaker) after the first-digit value then that tailored CE would still render CODAN CEs invalid.

The simplest might be to not allow tailoring to the first-in-script CEs. (Don't add syntax to do so.)

Resize issue view side panel

Fixed

Details

Assignee

Markus Scherer

Reporter

Markus Scherer

Components

collation

Labels

fixed-collv2

Priority

major

Time Needed

Hours

Fix versions

53.1 (release)

Created June 28, 2018 at 5:19 PM

Updated October 3, 2018 at 10:54 PM

Resolved July 1, 2018 at 8:50 PM

CODAN generates ill-formed primary weights

Description

Linked work items

relates to

Activity

UnicodeBot June 30, 2018 at 11:50 PM

Trac Comment 8 by @Peter Edberg—2014-03-14T05:42:07.240Z

UnicodeBot June 30, 2018 at 11:50 PM

Trac Comment 6 by @Markus Scherer—2013-09-13T18:13:38.649Z

UnicodeBot June 30, 2018 at 11:50 PM

Trac Comment 4 by @Markus Scherer—2012-12-30T17:57:48.702Z

UnicodeBot June 30, 2018 at 11:50 PM

Trac Comment 3 by @Markus Scherer—2012-08-31T21:29:17.639Z

Details

Assignee

Reporter

Components

Labels

Priority

Time Needed

Fix versions

UnicodeBot
June 30, 2018 at 11:50 PM

UnicodeBot
June 30, 2018 at 11:50 PM

UnicodeBot
June 30, 2018 at 11:50 PM

UnicodeBot
June 30, 2018 at 11:50 PM