Collation bug in discontiguous contractions with some Tibetan vowel signs


Assume a contraction of AB mapping to some collation element. Discontiguous contraction handling happens when a combining mark m occurs between A and B, e.g., AmB. It is done when ccc(B)

=0 and ccc(B)!=ccc(m).

Mostly, this works. However, it collides with one of our optimizations: UCA is specified to work on NFD text, but when an ICU Collator gets FCD input it just uses it, with data that has undergone canonical closure.

*The problem* is that there are three Tibetan vowel signs with ccc==0 but whose decompositions have ccc!=0: `[[:ccc=0:]&


Consider UCA 6.0's CollationTest_*.txt files with lines like `0FB2 0334 0F81`: The line passes the FCD check but the U+0F81 has ccc=0 which prevents the discontiguous contraction of 0FB2+0F81 from even being considered, so we get the wrong result.

We might have a related problem when the intervening combining mark m is one of these three Tibetan vowels: We probably don't consider discontiguous contraction across one of these either.

A possible fix might check for the lccc value of m and B rather than their ccc values. (Not discussed or tested.)


July 1, 2018, 9:28 AM
Trac Comment 7 by —2013-02-12T21:21:06.382Z

The following is the same as a comment on CldrBug:5667.

We might need to add U+0344 to this list.

Richard Wordingham provided an example today on the unicode list ("FCD and Collation") where U+0344 COMBINING GREEK DIALYTIKA TONOS, which is equivalent to <0308 0301> (both ccc=230), can also cause incorrect results despite FCD input.

Consider a tailoring with contractions 03B1+0308 and 0301+0345.

Assume a builder that adds further contractions to cover overlaps between contractions and decompositions. It would add 03B1+0344, 03B1+0344+0345 and 0344+0345.

Input string: 03B1 0359 0344 0345 (with U+0359 COMBINING ASTERISK BELOW as an example for any character with ccc<230) processed via discontiguous-contraction matching as 03B1+0344+0345, 0359

... but when processing the NFD form 03B1 0359 0308 0301 0345 we get 03B1+0308, 0359, 0301+0345 – note the different position of the 0359.

The full set of problematic characters appears to be `


` (link to Mark's demo) == `[\u0344\u0F73\u0F75\u0F81]`.

July 1, 2018, 9:28 AM
Trac Comment 8 by —2013-11-09T00:02:58.683Z

The "collv2" branch implementation decomposes the Tibetan composite vowels in the FCD check, even if the segment otherwise passes the check.

I am back to thinking that U+0344 is different, and should be handled by adding overlap contractions as needed.



Markus Scherer


Markus Scherer







Time Needed


Fix versions