See also CLDR defect 1514.
Here's a note from Mark Davis on 2007-09-04 giving more details. I tend to favour his proposed solution 2.
ICU could do the right thing by fully normalizing, but at a definite performance hit for any affected language like Vietnamese. By adding some extra information, we concluded that we could both do the right thing and keep the performance up. So that's why Vladimir filed the bug on CLDR, to add that extra data.
Let me recount the issue. The desired ordering is:
a << a-dot < a-hat << a-dot-hat
The weights would be:
WA << WA+WD < WAH << WAH+WD
where WA means Weight of A, WAH means Weight of A-hat (a primary difference), and WD means Weight of Dot (a primary ignorable)
That is, the dot is a secondary difference and the hat is a primary (letter) difference. Let's expand this out by adding all of the following equivalences. I mark the cases where we transform to FCD also, and show the desired weights after == for the first item in each equivalency group.
A == WA
A-dot == WA+WD
A-hat == WAH
A + hat
A + dot + hat == WAH + WD
A + hat + dot => A + dot + hat (FCD)
A-dot + hat
A-hat + dot => A + dot + hat (FCD)
Because of FCD, the last cases devolved to 3:
1. A + dot + hat
2. A-dot + hat
At build time, we build data for the canonical equivalents of all the characters that are tailored. So we build a table for
A + hat
but none of the other cases. Now, in processing, we deal with #1 correctly, by doing a discontiguous contraction, joining A-hat, to get WAH, then having the weight for the dot come after. But the other two cases don't get touched, so we get a difference that shouldn't exist.
Here are some of the alternatives that would solve the problem for Vietnamese.
Have a flag to fully decompose (NFD, not just FCD). Use that to fully decompose Vietnamese. Expensive.
Note at build time when we see a contraction of X + combining mark, and mark all of the characters containing the base letter of X as needing full decomposition. Thus when we hit a-dot, or a-ring, or anything containing 'a' or 'A', AND that character is followed by a combining mark, we'd decompose fully. Faster, since we don't always decompose, but lookahead, plus new code for a different kind of operation.
Add all the characters that are significant to Vietnamese to the ones that get pre-built table entries. At that point, we'd build and cache all the canonically equivalent cases. Thus A-dot-hat would have a pre-build entry (WAH + WD) – and by equivalence, also A-dot + hat, which would be a contraction that produced WAH + WD also. Much cheaper, and little (though not zero) code. It is not completely general, since it wouldn't handle, say, a+bar_below + hat correctly unless we knew when the rules were built that this combination was important.
(The bug doesn't need the CLDR data for fixing in the short term – that can be done wholly in ICU as a special case. But for the longer term, we should add to CLDR.)
More info from Mark Davis on 2007-09-05:
Vladimir and I gnawed at this a bit more, and came up with what we think is the optimal solution. The problem is that with certain precomposed characters, we are not allowing the discontiguous contractions to work, since we don't see the base character if we leave it in FCD. (If it is not in FCD, it's ok, since we force a decomposition in that case.) So here is option 4.
4. When we add a contraction with base + combining (example: a + hat), we find all precomposed characters C that meet the following conditions:
C's canonical decomposition contains the base (example: a)
The combining mark is not blocked. (example: we exclude a-umlaut, because umlaut blocks hat)
The sequence C + combining is not in FCD (example: if our original contraction were a+dot, we exclude C = a-hat)
We then prebuild contractions with each of the C + hat, and add them to the table.
We think this will solve all the general cases of canonical equivalance, with no change to runtime code, some manageable new build code, and a small cost to performance.
A section of the original message was not readable. Here it is with curly braces:
Fixed the code in ucol_elm.cpp and added test cases against this
Please review, thanks!
There problems fixed in trac include:
1. U+1EAC != U+1EA0 U+0302.
2. If the tailing rule contains a accented character at end of contractions,
the FCD is broken. For example:
& a < a\u00EA (a+e with circumflex)
string a\u1EAD( a+e with dot below and circumflex) has same primary weights with "ae" not "a\u00EA".
3. Testing function hasCollationElements() always return false.