Fixed
Details
Priority
blocks-releaseAssignee
Mark DavisMark DavisReporter
Richard GibsonRichard GibsonReviewer
Annemarie Apple 🍎Annemarie Apple 🍎Fix versions
Components
Labels
Details
Details
Priority
Assignee
Mark Davis
Mark DavisReporter
Richard Gibson
Richard GibsonReviewer
Annemarie Apple 🍎
Annemarie Apple 🍎Fix versions
Components
Labels
Created August 16, 2023 at 10:17 PM
Updated January 30, 2025 at 3:23 PM
Resolved January 30, 2025 at 3:23 PM
The intent of “Canonicalizing Syntax”: “Preprocessing” https://unicode.org/reports/tr35/#preprocessing step 5 appears to be a multidimensional sort of rules by
total aggregate cardinality of field value sets, resolving ties by
lexicographic ordering by non-emptiness of [Language, Script, Region, Variants] field value sets (as would be achieved by numerically sorting a 4-bit representation of each rule in which bit corresponds with absence/emptiness of each of those fields in that order, e.g.
(isEmpty(rule.L) << 3) | (isEmpty(rule.S) << 2) | (isEmpty(rule.R) << 1) | (isEmpty(rule.V))
sorting {L={zh}, S={Hant}, R={CN}} < {L={zh}, S={Hans}, V={pinyin}} < {L={en}, R={GB}, V={scouse}} < {V={fonipa,hepburn,heploc}}), resolving ties bycase-insensitive ASCIIbetical (i.e., digits before letters) lexicographic ordering by field value set elements for each of [Language, Script, Region, Variants] in that order (e.g., {L={ja}, V={hepburn,heploc}} < {L={zh}, V={1996,pinyin}} < {L={zh}, V={hepburn,heploc}}).
However, the explanation has issues:
In steps 5.1 and 5.2, “And then order by” prefixes can be interpreted as denoting a single independent chronological step applying to the entire collection of rules rather than to each subcollection of ties from the previous step. Similarly, “After this point” in step 5.2 does not have a clear interpretation.
In step 5.2, “order by field” is not explicit about considering just comparison of binary non-empty field value set vs. absent/empty, ordering non-empty sets before absent/empty sets. Similarly, neither it nor step 5.3 are not explicit about only considering the next field for resolving ties.
Also in step 5.2, the second bullet point appears to be comparing equal multimaps (i.e., both specify only Variants and have two elements in the corresponding set of values—”hepburn” and “heploc”).
The bullet points for steps 5.1, 5.2, and 5.3 are not indented to clarify their scope.
The i, ii, and iii columns of the example table require a lot of inference to understand that data in their cells explains why any given row is ordered after its predecessor.
I’m not even certain that my understanding is correct, but if it is then the explanation could use some improvement. For example:
Order the set of rules using the following comparison logic:
For each rule, count the number of items in each field value set (L, S, R, V) and sum the four counts. If two rules have differing sums, order the rule with the greater sum before the rule with the smaller sum.
For example, {V={hepburn,heploc}} is tied with {L={en}, R={GB}} (because both have 2 total field value items) and both precede {R={CA}} (which has 1).
For rule pairs that are not differentiated by the previous step, consider the value set for each field in the order L, then S, then R, then V. If one rule has a non-empty value set for that field and the other rule does not, then order the rule with the non-empty value set for that field before the other rule and disregard all later fields. Otherwise, consider the next field.
For example, {L={zh}, S={Hant}, R={CN}} is tied with {L={en}, S={Latn}, R={GB}} (because both have non-empty sets for L, S, and R but not for V), and both precede {L={zh}, S={Hans}, V={pinyin}} (because it lacks values for R), which precedes {L={en}, R={GB}, V={scouse}} (because it lacks values for S), which precedes {V={fonipa,hepburn,heploc}} (because it lacks values for L), which is tied with {V={hepburn,heploc,simple}} (because both have non-empty sets for V but not for L, S, or R).
For rule pairs that are not differentiated by the previous step, consider the value set for each field in the order L, then S, then R, then V as a sequence of subtags. If those lists for the same field of two rules differ, then consider the first position of difference in the two lists and order the rules by alphanumeric comparison (in which digits 0 through 9 sort in that order before all letters) of the field value at that position and disregard all later fields. Otherwise, consider the next field.
For example, {L={ja}, V={hepburn,heploc}} precedes {L={zh}, V={1996,pinyin}} (because it has a different field value set for L and "ja" precedes "zh" at the first position of difference), which precedes {L={zh}, V={hepburn,heploc}} (because it has the same field value set for L and a different field value set for V in which "1996" precedes "hepburn" at the first position of difference), which precedes {L={zh}, V={hepburn,simple}} (because it has the same field value set for L and a different field value set for V in which "heploc" precedes "simple" at the first position of difference).
So using the examples above, we get the following order (where the cell in a 5.2 or 5.3 column compares a row to its predecessor, and a blank cell indicates differentiation by an earlier step):
languageId
5.1 total field value set item count
5.2 non-empty field value set
5.3 field value set items
{L={en}, S={Latn}, R={GB}}
3
n/a
n/a
{L={zh}, S={Hant}, R={CN}}
3
match (L, S, R)
in L, “en” before “zh”
{L={zh}, S={Hans}, V={pinyin}}
3
(L, S, R, …) before (L, S, V)
{L={en}, R={GB}, V={scouse}}
3
(L, S, …) before (L, R, …)
{L={ja}, V={hepburn,heploc}}
3
(L, R, …) before (L, V)
{L={zh}, V={1996,pinyin}}
3
match (L, V)
in L, “ja” before “zh”
{L={zh}, V={hepburn,heploc}}
3
match (L, V)
in V, “1996” before “hepburn”
{L={zh}, V={hepburn,simple}}
3
match (L, V)
in V, “heploc” before “simple”
{V={fonipa,hepburn,heploc}}
3
(L, …) before (V)
{V={hepburn,heploc,simple}}
3
match (V)
in V, “fonipa” before “hepburn”
{L={en}, R={GB}}
2
{V={hepburn,heploc}}
2
(L, …) before (V)
{R={CA}}
1