LDML collation normalization=off cannot sort all FCD strings correctly

Description

Deleted Component: xxx-spec

The LDML spec Collation Settings table says for the kk=normalization attribute: "If off, then all strings that are in [FCD] will sort correctly"

This is not quite true. If a string (FCD or not) contains one of the Tibetan precomposed vowels (U+0F73, U+0F75 or U+0F81), then the precomposed vowel must be decomposed or such a string might not sort correctly. The problem is that any contraction with the second part of the vowel decomposition needs to skip the first part. (Discontiguous contraction matching: UCA algorithm S2.1.1-S2.1.3) The DUCET itself has such contractions: The precomposed vowels’ decompositions themselves.

Suggestion: Change the normalization attribute spec to say "If off, then all strings that are in [FCD] and do not contain U+0F73 nor U+0F75 nor U+0F81 will sort correctly"

xpath

None

locale

None

Activity

Show:
TracBot
May 10, 2019, 3:37 AM
Trac Comment 1 by —2013-02-12T21:20:51.868Z

We might need to add U+0344 to this list.

Richard Wordingham provided an example today on the unicode list ("FCD and Collation") where U+0344 COMBINING GREEK DIALYTIKA TONOS, which is equivalent to <0308 0301> (both ccc=230), can also cause incorrect results despite FCD input.

Consider a tailoring with contractions 03B1+0308 and 0301+0345.

Assume a builder that adds further contractions to cover overlaps between contractions and decompositions. It would add 03B1+0344, 03B1+0344+0345 and 0344+0345.

Input string: 03B1 0359 0344 0345 (with U+0359 COMBINING ASTERISK BELOW as an example for any character with ccc<230) processed via discontiguous-contraction matching as 03B1+0344+0345, 0359

... but when processing the NFD form 03B1 0359 0308 0301 0345 we get 03B1+0308, 0359, 0301+0345 – note the different position of the 0359.

The full set of problematic characters appears to be `

&

` (link to Mark's demo) == `[\u0344\u0F73\u0F75\u0F81]`.

(I am copying this text into as well.)

Priority

medium

Assignee

Markus Scherer

Reporter

Markus Scherer

Reviewer

Mark Davis

Labels

Components

None

Fix versions

Phase

None