Assume a contraction of AB mapping to some collation element. Discontiguous contraction handling happens when a combining mark m occurs between A and B, e.g., AmB. It is done when ccc(B)
=0 and ccc(B)!=ccc(m).
Mostly, this works. However, it collides with one of our optimizations: UCA is specified to work on NFD text, but when an ICU Collator gets FCD input it just uses it, with data that has undergone canonical closure.
*The problem* is that there are three Tibetan vowel signs with ccc==0 but whose decompositions have ccc!=0: `[[:ccc=0:]&
Consider UCA 6.0's CollationTest_*.txt files with lines like `0FB2 0334 0F81`: The line passes the FCD check but the U+0F81 has ccc=0 which prevents the discontiguous contraction of 0FB2+0F81 from even being considered, so we get the wrong result.
We might have a related problem when the intervening combining mark m is one of these three Tibetan vowels: We probably don't consider discontiguous contraction across one of these either.
A possible fix might check for the lccc value of m and B rather than their ccc values. (Not discussed or tested.)
The following is the same as a comment on CldrBug:5667.
We might need to add U+0344 to this list.
Richard Wordingham provided an example today on the unicode list ("FCD and Collation") where U+0344 COMBINING GREEK DIALYTIKA TONOS, which is equivalent to <0308 0301> (both ccc=230), can also cause incorrect results despite FCD input.
Consider a tailoring with contractions 03B1+0308 and 0301+0345.
Assume a builder that adds further contractions to cover overlaps between contractions and decompositions. It would add 03B1+0344, 03B1+0344+0345 and 0344+0345.
Input string: 03B1 0359 0344 0345 (with U+0359 COMBINING ASTERISK BELOW as an example for any character with ccc<230) processed via discontiguous-contraction matching as 03B1+0344+0345, 0359
... but when processing the NFD form 03B1 0359 0308 0301 0345 we get 03B1+0308, 0359, 0301+0345 – note the different position of the 0359.
The full set of problematic characters appears to be `
` (link to Mark's demo) == `[\u0344\u0F73\u0F75\u0F81]`.
The "collv2" branch implementation decomposes the Tibetan composite vowels in the FCD check, even if the segment otherwise passes the check.
I am back to thinking that U+0344 is different, and should be handled by adding overlap contractions as needed.