For the two canonically equivalent strings U+1EAC and U+1EA0 U+0302, the sort
keys generated when using the ro (Romanian) or the vi (Vietnamese) collators are
not identical (normalization is set to on). According to Vladimir Weinstein's
email dated 2007-08-29:
The problem is that half composed form U+1EA0 U+0302 does not trigger "A starts
a contraction" rule and thus doesn't do the discontiguos contraction that
matches A + U+0302.
From the following set of strings:
"\u1EA0\u0302",
"\u1EAC",
"\u0041\u0323\u0302",
"\u00c2\u0323",
"\u0041\u0302\u0323"
only U+1EA0 U+0302 produces a different result (when normalization is turned on,
of course).
There are several ways to fix this.
Immediate way is to add more rules to Vietnamese collation that would handle the
different positioning of combining marks.
Long term, locales such as Vietnamese would include repertoire set that would
tell us what kind of letter/marks combinations we can expect. This could be then
used to generate additional data that would fix this problem. This information
could go into CLDR.
sent reply 2
changed notes2
moved from incoming to data
changed notes2
moved from data to returned