Use the CLDR alias/replacement data in canonicalizing

Description

Added (on my machine) to ULocaleTest.TestCanonicalization to test some of the canonicalization. According to the CLDR data, the following should work:

(With a caveat: we have to decide whether the canonicalization of extlang should happen so that ULocale.getName(source) returns it, or that it should only happen in ULocale.canonicalize(source);

Anyway, the current test results are:

See http://unicode.org/cldr/trac/ticket/2787

Activity

Show:
Markus Scherer
September 26, 2018, 10:39 PM

Please don't grab a ticket from someone without discussion on the team list and/or in the team meeting.

It is industry practice, and Unicode CLDR standard, to interpret a macrolanguage code like "zh" as the most common language that it includes. That is, zh is Mandarin, and cmn is redundant. CLDR has mappings for these:

https://unicode.org/cldr/trac/browser/trunk/common/supplemental/supplementalMetadata.xml

For example,
<languageAlias type="cmn" replacement="zh" reason="macrolanguage"/>

I realize that this contradicts https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
Type: redundant
Tag: zh-cmn
Description: Mandarin Chinese
Added: 2005-07-15
Deprecated: 2009-07-29
Preferred-Value: cmn

but we do really want to follow Unicode CLDR on this.

It might be ok to add a one-off mapping if it matches the CLDR data, but we should really generate the whole mapping data from CLDR.

Jungshik Shin
September 27, 2018, 7:51 AM

How about revising the IANA language subtag registry to match the CLDR in case of macro language handling?

For EcmaScript Intl.Locale API, I filed to introduce a strict BCP 47/IANA registry compliance mode for uloc_forLanguageTag + uloc_toLanguageTag.

Frank, this issue wouldn't help your work on v8's implementation of Intl.Locale.

Markus Scherer
September 27, 2018, 9:02 PM

I have no idea if the IETF cares to follow Unicode language identifiers, but ICU does, and much of the industry too.

For example, zh = zh-cmn = cmn and needs to canonicalize to zh. It is useless for tagging contents to interpret zh as "Chinese languages", just like we don't tag contents as "Germanic languages" and leave it open whether it's Danish or German.

It would be more productive to change the ECMAScript spec to refer to Unicode language identifiers rather than raw BCP 47.
http://www.unicode.org/reports/tr35/#Unicode_language_identifier

Markus Scherer
June 2, 2020, 10:17 PM

Can we declare this ticket done via Frank’s ICU 67 work? I know there are follow-up items, but the canonicalization does use CLDR data now. Right?

Frank Yung-Fong Tang
September 9, 2020, 6:22 PM

This is addressed in

Assignee

Frank Yung-Fong Tang

Reporter

Mark Davis

Components

Labels

None

Reviewer

None

Priority

assess

Time Needed

Days

Fix versions

Configure