Add Windows locale and collation aliases to icu-locale-deprecates.xml and icu-coll-deprecates.xml

Description

Windows adds a few aliases to the icu-locale-deprecates.xml and icu-coll-deprecates.xml files to smooth things out between Windows locales and CLDR locales.

I would like to add the following aliases to ICU.

icu-locale-deprecates.xml:
<alias from="ccp_Cakm_BD" to="ccp_BD"/>
<alias from="ccp_Cakm_IN" to="ccp_IN"/>
<alias from="ceb_Latn_PH" to="ceb_PH"/>
<alias from="iu_Cans_CA" to="iu_CA"/>
<alias from="jv_Latn_ID" to="jv_ID"/>
<alias from="mi_Latn_NZ" to="mi_NZ"/>
<alias from="quz" to="qu"/>
<alias from="quz_BO" to="qu_BO"/>
<alias from="quz_EC" to="qu_EC"/>
<alias from="quz_PE" to="qu_PE"/>

icu-coll-deprecates.xml:
<alias from="ccp_Cakm_BD" to="ccp_BD"/>
<alias from="ccp_Cakm_IN" to="ccp_IN"/>
<alias from="ceb_Latn_PH" to="ceb_PH"/>
<alias from="ff_CM" to="ff_Latn_CM"/>
<alias from="ff_GN" to="ff_Latn_GN"/>
<alias from="ff_MR" to="ff_Latn_MR"/>
<alias from="ff_SN" to="ff_Latn_SN"/>
<alias from="iu_Cans_CA" to="iu_CA"/>
<alias from="jv_Latn_ID" to="jv_ID"/>
<alias from="mi_Latn_NZ" to="mi_NZ"/>
<alias from="quz" to="qu"/>
<alias from="quz_BO" to="qu_BO"/>
<alias from="quz_EC" to="qu_EC"/>
<alias from="quz_PE" to="qu_PE"/>
<alias from="sr_Cyrl_CS" to="sr_Cyrl_RS"/>
<alias from="sr_Latn_CS" to="sr_Latn_RS"/>

Activity

Show:
Jeff Genovy
January 29, 2020, 6:57 PM

Adding in some notes from the discussion in the ICU-TC on 2020-01-29:

 

Markus: For the collation, many of these shouldn’t be needed. It should fallback (via truncate from the right) to the base tag (assuming that there is actually collation data). For example ccp_Cakm_BD should eventually try ccp (is there is data).

Steven: Though for quz, that’s a macro tag, so the mapping from quz to qu might not happen automatically.

Markus: Right, but we don’t want to have mappings for every single macro language tag that exists, there are too many.

Peter: We add a macro tag mapping for yue though, due to the usage.

Steven: For macro tags (or others) that are platform specific, they might want to be handled in a layer above the ICU API(s).

Jeff: I think the entry for iu is because we pick up other locale(s) from CLDR seed that have different scripts. (ex: Latn).

Peter: For things like iu_CA, if you have more than one script you might also want check/change things in the Likely Subtags as well.

 

Regarding the addition of scripts in the tag:

Jeff: In the past CLDR would omit the script from the locale name if there was only one locale within CLDR main. However, with more and more locales being adding, there is perhaps a desire to add the script now.

For Microsoft, we have other language tag parsers that are very strict w.r.t. the IANA Subtag Registry and IANA’s notion of a Suppress-Script. If a language tag doesn’t have a suppress-script registered in IANA, then it expects to see it in the tag.

(Note: en does have a suppress-script of Latn, so en-US is valid. If you wanted Dsrt, then you’d need to be explicit and have en-Dsrt-US).

Peter: I would be in favor of including the script when there is more than one possible script in usage (even if the other script doesn’t have a locale in CLDR).

Jeff; We use IANA to decided the suppress-script.

Peter: That sounds like an idea/possibility for CLDR as well.

Markus/Peter: We should file another ticket for CLDR to have more discussion there about this.

Daniel offered to file a CLDR ticket and link it to this one.

Markus: Maybe Daniel can take this ticket for Design for Future.

 

Markus Scherer
February 6, 2020, 8:53 PM

Another comment about macrolanguages – for example, there are some 37 language subtags for “<something> Arabic” while most products only have translations for “ar”. I don’t think we want an alias file for each of the variants.

Some of this could be handled by canonicalizing a language tag / locale ID (especially after Frank’s bug fixes), at least for mapping the implied language to the macrolanguage, as in cmn->zh.

Another way to deal with this is to use the LocaleMatcher. CLDR recently added fallback mappings from many language variants to a commonly used language, such as from the various Arabics to “ar”. When there is data for the language variant, then that is preferred.

Also, for collation, if a language is commonly written in multiple scripts, then the tailoring for the base language should implement one sort order that covers all of the scripts. They don’t conflict with each other. (The Unicode default sort order may already be right for one or more of the scripts.)

This is also why we don’t (or shouldn’t) use special parent fallbacks in collation, like zh-Hant skipping zh. (We do have a coll/zh_Hant.txt file, but it just selects the stroke order that it inherits from zh.)

Assignee

Daniel Ju

Reporter

Daniel Ju

Components

Labels

None

Reviewer

None

Priority

assess

Time Needed

None

Fix versions

Configure