UTS #35 Annex C "LocaleId Canonicalization" Preprocessing algorithm seems underspecified

Description

The intent of “Canonicalizing Syntax”: “Preprocessing” https://unicode.org/reports/tr35/#preprocessing step 5 appears to be a multidimensional sort of rules by

  1. total aggregate cardinality of field value sets, resolving ties by

  2. lexicographic ordering by non-emptiness of [Language, Script, Region, Variants] field value sets (as would be achieved by numerically sorting a 4-bit representation of each rule in which bit corresponds with absence/emptiness of each of those fields in that order, e.g. (isEmpty(rule.L) << 3) | (isEmpty(rule.S) << 2) | (isEmpty(rule.R) << 1) | (isEmpty(rule.V)) sorting {L={zh}, S={Hant}, R={CN}} < {L={zh}, S={Hans}, V={pinyin}} < {L={en}, R={GB}, V={scouse}} < {V={fonipa,hepburn,heploc}}), resolving ties by

  3. case-insensitive ASCIIbetical (i.e., digits before letters) lexicographic ordering by field value set elements for each of [Language, Script, Region, Variants] in that order (e.g., {L={ja}, V={hepburn,heploc}} < {L={zh}, V={1996,pinyin}} < {L={zh}, V={hepburn,heploc}}).

 

However, the explanation has issues:

  1. In steps 5.1 and 5.2, “And then order by” prefixes can be interpreted as denoting a single independent chronological step applying to the entire collection of rules rather than to each subcollection of ties from the previous step. Similarly, “After this point” in step 5.2 does not have a clear interpretation.

  2. In step 5.2, “order by field” is not explicit about considering just comparison of binary non-empty field value set vs. absent/empty, ordering non-empty sets before absent/empty sets. Similarly, neither it nor step 5.3 are not explicit about only considering the next field for resolving ties.

  3. Also in step 5.2, the second bullet point appears to be comparing equal multimaps (i.e., both specify only Variants and have two elements in the corresponding set of values—”hepburn” and “heploc”).

  4. The bullet points for steps 5.1, 5.2, and 5.3 are not indented to clarify their scope.

  5. The i, ii, and iii columns of the example table require a lot of inference to understand that data in their cells explains why any given row is ordered after its predecessor.

 

 

I’m not even certain that my understanding is correct, but if it is then the explanation could use some improvement. For example:


Order the set of rules using the following comparison logic:

  1. For each rule, count the number of items in each field value set (L, S, R, V) and sum the four counts. If two rules have differing sums, order the rule with the greater sum before the rule with the smaller sum.

    1. For example, {V={hepburn,heploc}} is tied with {L={en}, R={GB}} (because both have 2 total field value items) and both precede {R={CA}} (which has 1).

  2. For rule pairs that are not differentiated by the previous step, consider the value set for each field in the order L, then S, then R, then V. If one rule has a non-empty value set for that field and the other rule does not, then order the rule with the non-empty value set for that field before the other rule and disregard all later fields. Otherwise, consider the next field.

    1. For example, {L={zh}, S={Hant}, R={CN}} is tied with {L={en}, S={Latn}, R={GB}} (because both have non-empty sets for L, S, and R but not for V), and both precede {L={zh}, S={Hans}, V={pinyin}} (because it lacks values for R), which precedes {L={en}, R={GB}, V={scouse}} (because it lacks values for S), which precedes {V={fonipa,hepburn,heploc}} (because it lacks values for L), which is tied with {V={hepburn,heploc,simple}} (because both have non-empty sets for V but not for L, S, or R).

  3. For rule pairs that are not differentiated by the previous step, consider the value set for each field in the order L, then S, then R, then V as a sequence of subtags. If those lists for the same field of two rules differ, then consider the first position of difference in the two lists and order the rules by alphanumeric comparison (in which digits 0 through 9 sort in that order before all letters) of the field value at that position and disregard all later fields. Otherwise, consider the next field.

    1. For example, {L={ja}, V={hepburn,heploc}} precedes {L={zh}, V={1996,pinyin}} (because it has a different field value set for L and "ja" precedes "zh" at the first position of difference), which precedes {L={zh}, V={hepburn,heploc}} (because it has the same field value set for L and a different field value set for V in which "1996" precedes "hepburn" at the first position of difference), which precedes {L={zh}, V={hepburn,simple}} (because it has the same field value set for L and a different field value set for V in which "heploc" precedes "simple" at the first position of difference).

So using the examples above, we get the following order (where the cell in a 5.2 or 5.3 column compares a row to its predecessor, and a blank cell indicates differentiation by an earlier step):

languageId

5.1 total field value set item count

5.2 non-empty field value set

5.3 field value set items

{L={en}, S={Latn}, R={GB}}

3

n/a

n/a

{L={zh}, S={Hant}, R={CN}}

3

match (L, S, R)

in L, “en” before “zh”

{L={zh}, S={Hans}, V={pinyin}}

3

(L, S, R, …) before (L, S, V)

 

{L={en}, R={GB}, V={scouse}}

3

(L, S, …) before (L, R, …)

 

{L={ja}, V={hepburn,heploc}}

3

(L, R, …) before (L, V)

 

{L={zh}, V={1996,pinyin}}

3

match (L, V)

in L, “ja” before “zh”

{L={zh}, V={hepburn,heploc}}

3

match (L, V)

in V, “1996” before “hepburn”

{L={zh}, V={hepburn,simple}}

3

match (L, V)

in V, “heploc” before “simple”

{V={fonipa,hepburn,heploc}}

3

(L, …) before (V)

 

{V={hepburn,heploc,simple}}

3

match (V)

in V, “fonipa” before “hepburn”

{L={en}, R={GB}}

2

 

 

{V={hepburn,heploc}}

2

(L, …) before (V)

 

{R={CA}}

1

 

 

Activity

Show:

Annemarie Apple 🍎 January 30, 2025 at 3:23 PM

Closing as fixed since the spec was updated as planned in this ticket in CLDR 45.

Mark Davis December 5, 2023 at 6:59 PM

Unfortunately, the fixVersion wasn’t added, so we missed this in v44. Just set to 45

Mark Davis August 28, 2023 at 5:31 PM

Accepted in CLDR Design meeting, 2023-8-28

Mark Davis August 19, 2023 at 10:13 PM

I read through the revised text, and that is the correct interpretation. I recommend making changes along those lines.

Fixed

Details

Priority

Assignee

Reporter

Reviewer

Fix versions

Components

Created August 16, 2023 at 10:17 PM
Updated January 30, 2025 at 3:23 PM
Resolved January 30, 2025 at 3:23 PM

Flag notifications