Disambiguate languages with multiple scripts used by large populations in different countries
Description
relates to
Activity
Show:
UnicodeBot last month
🛬 Merged PR
@conradarcturus merged a PR to unicode-org/cldr:main
CLDR-18114 Add explicit script markers in population data (#4463)
Mark Davis March 19, 2025 at 4:48 PM
Also add to release notes.

Annemarie Apple 🍎 March 6, 2025 at 7:11 PM
This looks like it should be in 48 not 47.

Night February 12, 2025 at 4:19 PM
Chinese (Malaysia) is still in use currently. Major should be Simplified, but Traditional is also in use so can be moved to Other Countries.
Currently pending

Annemarie Apple 🍎 November 20, 2024 at 5:30 PM
CLDR TC Decision 2024-11-20:
Agree with the principle that there should be an explicit primary script per each language territory combination for any languages which have multiple primary scripts.
Next steps: Make recommendations for assigning these for the TC review.
While we fix the ambiguity of languages with multiple scripts ( ) there are some more sensitive cases. Some languages are used by large populations in multiple countries and we want to be careful how we word primary scripts.
Tasks:
Resolve ambiguity:
Enforce that all of these variations have population data AND the input population data is not minimized, eg. “az_Latn” for the AZ entry and “az_Arab” for the IR entry.
Output data to supplementalData.xml will remain minimized for now. likelySubtags will be the source of truth for default script.
Enforce consistency:
Make sure only these languages have multiple primaries in language_script.tsv. Languages currently listed at multiple primary: ks
Update tests and documentation.
Open question: Should we stop completely minimizing locales in the generated XML population data?
eg. “zh” = Simplified Chinese (Mandarin) but “zh_Hant” = Traditional Chinese (Mandarin)
I’d prefer to keep zh_Hans and zh_Hant here too.
Language
Script
Major Countries
Other countries
Azerbaijani [az]
Latn
AZ
AM, TR
Arab
IR
IQ, TR
Cyrl
RU
AZ
Hakka Chinese [hak]
Hans
CN
Hant
TW
Kurdish [ku]
Latn
TR, SY
DE
Arab
IQ, IR
Cyrl
AZ, AM, GE, TM
Hassaniyya [mey]
Arab
MR
All others
Latn
SN
Min Nan Chinese [nan]
Hans
CN
Hant
TW
MO
Punjabi [pa]
Arab
PK
Guru
IN
CA, GB, KE, SG
Cantonese [yue]
Hant
HK, TW, MO
CA, CN
Hans
CN
Chinese [zh]
Hans
CN
HK, MO, MY, SG
Hant
HK, TW, MO
CA, MN
Bopo
Latn
Phag
Notes:
Panjabi: Most countries are pa_Guru because the Panjabi population is largely Sikh
Chinese
Most large migrate communities left before Simplified Chinese was adopted
But since then, there has been considerable recent migration so the true value is not really known.
Changed to Simplified in Singapore many years ago
Malaysia was confusingly written “Chinese (Traditional) … zh” – but I confirmed and major usage in Malaysia changed to Simplified. Other countries currently written with Chinese (Traditional) may need to be changed or at least the usage of both scripts acknowledged.
Kurdish (technically Northern Kurdish, Kurmanji)
Hawar alphabet is large in Syria
Cyrillic was common for Kurdish in the Soviet Union – but there is no source today for those countries if it has changed