Consistent issue with Locale canonicalize and UTS35

Description

Currently, the icu::Locale::canonicalize process will consider the information in the REDUNDANT array so
for example "sgn-GR" will be canonicalized into "gss", and "ja-latn-hepburn-heploc" into "ja-latn-alalc97"

These are entries in https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
with Type: redundant

In uloc_tag.cpp

/*
Updated on 2018-09-12 from
https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry .

The table lists redundant tags with preferred value in the IANA languate tag registry.
It's generated with the following command:

curl https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry |\
grep 'Type: redundant' -A 5 | egrep '^(Tag:|Prefer)' | grep -B1 'Preferred' | \
awk -n '/Tag/ {printf(" \"%s\", ", $2);} /Preferred/ {printf("\"%s\",\n", $2);}' | \
tr 'A-Z' 'a-z'

In addition, ja-latn-hepburn-heploc is mapped to ja-latn-alalc97 because
a variant tag 'hepburn-heploc' has the preferred subtag, 'alaic97'.
*/

static const char* const REDUNDANT[] = {
// redundant preferred
"sgn-br", "bzs",
"sgn-co", "csn",
"sgn-de", "gsg",
"sgn-dk", "dsl",
"sgn-es", "ssp",
"sgn-fr", "fsl",
"sgn-gb", "bfi",
"sgn-gr", "gss",
"sgn-ie", "isg",
"sgn-it", "ise",
"sgn-jp", "jsl",
"sgn-mx", "mfs",
"sgn-ni", "ncs",
"sgn-nl", "dse",
"sgn-no", "nsl",
"sgn-pt", "psr",
"sgn-se", "swl",
"sgn-us", "ase",
"sgn-za", "sfs",
"zh-cmn", "cmn",
"zh-cmn-hans", "cmn-hans",
"zh-cmn-hant", "cmn-hant",
"zh-gan", "gan",
"zh-wuu", "wuu",
"zh-yue", "yue",

// variant tag with preferred value
"ja-latn-hepburn-heploc", "ja-latn-alalc97",
};

However, https://github.com/unicode-org/cldr/blob/master/common/supplemental/supplementalMetadata.xml does not contains these value and therefore, while ECMA402 apply the algorithm, test262 expect these value won't be canonicalized.

I am not sure where should we change. we should consider
1. Change ICU code to NOT consider these info in these info in REDUNDANT, OR
2. Change CLDR and dd entries to https://github.com/unicode-org/cldr/blob/master/common/supplemental/supplementalMetadata.xml to include these information, OR
3. Change CLDR to change UTS35 to clarify about this.

Activity

Show:
Frank Yung-Fong Tang
May 20, 2020, 8:22 PM

file the sgn* part into

file the zh-(gan|wuu|yue) part into

Frank Yung-Fong Tang
May 20, 2020, 8:31 PM

file the zh-cmn* part into

Peter Edberg
June 22, 2020, 3:59 PM

It turns out that no additional data from CLDR or elsewhere is required to address this. It can be fixed completely algorithmically in ICU. See the comments at the end of

Frank Yung-Fong Tang
September 9, 2020, 6:01 PM

the fix is in

Frank Yung-Fong Tang
September 23, 2020, 6:34 PM

fix landed

Fixed by Other Ticket

Assignee

Frank Yung-Fong Tang

Reporter

Frank Yung-Fong Tang

Components

Labels

Reviewer

None

Priority

medium

Time Needed

None

Fix versions

Configure