Collation bug in discontiguous contractions with some Tibetan vowel signs

Description

Assume a contraction of AB mapping to some collation element. Discontiguous contraction handling happens when a combining mark m occurs between A and B, e.g., AmB. It is done when ccc(B)

=0 and ccc(B)!=ccc(m).

Mostly, this works. However, it collides with one of our optimizations: UCA is specified to work on NFD text, but when an ICU Collator gets FCD input it just uses it, with data that has undergone canonical closure.

*The problem* is that there are three Tibetan vowel signs with ccc==0 but whose decompositions have ccc!=0: `[[:ccc=0:]&

]`

Consider UCA 6.0's CollationTest_*.txt files with lines like `0FB2 0334 0F81`: The line passes the FCD check but the U+0F81 has ccc=0 which prevents the discontiguous contraction of 0FB2+0F81 from even being considered, so we get the wrong result.

We might have a related problem when the intervening combining mark m is one of these three Tibetan vowels: We probably don't consider discontiguous contraction across one of these either.

A possible fix might check for the lccc value of m and B rather than their ccc values. (Not discussed or tested.)

Activity

Show:
TracBot
July 1, 2018, 9:28 AM
Trac Comment 7 by —2013-02-12T21:21:06.382Z

The following is the same as a comment on CldrBug:5667.

We might need to add U+0344 to this list.

Richard Wordingham provided an example today on the unicode list ("FCD and Collation") where U+0344 COMBINING GREEK DIALYTIKA TONOS, which is equivalent to <0308 0301> (both ccc=230), can also cause incorrect results despite FCD input.

Consider a tailoring with contractions 03B1+0308 and 0301+0345.

Assume a builder that adds further contractions to cover overlaps between contractions and decompositions. It would add 03B1+0344, 03B1+0344+0345 and 0344+0345.

Input string: 03B1 0359 0344 0345 (with U+0359 COMBINING ASTERISK BELOW as an example for any character with ccc<230) processed via discontiguous-contraction matching as 03B1+0344+0345, 0359

... but when processing the NFD form 03B1 0359 0308 0301 0345 we get 03B1+0308, 0359, 0301+0345 – note the different position of the 0359.

The full set of problematic characters appears to be `

&

` (link to Mark's demo) == `[\u0344\u0F73\u0F75\u0F81]`.

TracBot
July 1, 2018, 9:28 AM
Trac Comment 8 by —2013-11-09T00:02:58.683Z

The "collv2" branch implementation decomposes the Tibetan composite vowels in the FCD check, even if the segment otherwise passes the check.

I am back to thinking that U+0344 is different, and should be handled by adding overlap contractions as needed.

Fixed

Assignee

Markus Scherer

Reporter

Markus Scherer

Components

Labels

Reviewer

None

Priority

medium

Time Needed

Days

Fix versions