Added (on my machine) to ULocaleTest.TestCanonicalization to test some of the canonicalization. According to the CLDR data, the following should work:
(With a caveat: we have to decide whether the canonicalization of extlang should happen so that ULocale.getName(source) returns it, or that it should only happen in ULocale.canonicalize(source);
Anyway, the current test results are:
Please don't grab a ticket from someone without discussion on the team list and/or in the team meeting.
It is industry practice, and Unicode CLDR standard, to interpret a macrolanguage code like "zh" as the most common language that it includes. That is, zh is Mandarin, and cmn is redundant. CLDR has mappings for these:
<languageAlias type="cmn" replacement="zh" reason="macrolanguage"/>
I realize that this contradicts https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
Description: Mandarin Chinese
but we do really want to follow Unicode CLDR on this.
It might be ok to add a one-off mapping if it matches the CLDR data, but we should really generate the whole mapping data from CLDR.
How about revising the IANA language subtag registry to match the CLDR in case of macro language handling?
For EcmaScript Intl.Locale API, I filed to introduce a strict BCP 47/IANA registry compliance mode for uloc_forLanguageTag + uloc_toLanguageTag.
Frank, this issue wouldn't help your work on v8's implementation of Intl.Locale.
I have no idea if the IETF cares to follow Unicode language identifiers, but ICU does, and much of the industry too.
For example, zh = zh-cmn = cmn and needs to canonicalize to zh. It is useless for tagging contents to interpret zh as "Chinese languages", just like we don't tag contents as "Germanic languages" and leave it open whether it's Danish or German.
It would be more productive to change the ECMAScript spec to refer to Unicode language identifiers rather than raw BCP 47.
Can we declare this ticket done via Frank’s ICU 67 work? I know there are follow-up items, but the canonicalization does use CLDR data now. Right?
This is addressed in