Disambiguate languages with multiple scripts used by large populations in different countries

Description

While we fix the ambiguity of languages with multiple scripts ( ) there are some more sensitive cases. Some languages are used by large populations in multiple countries and we want to be careful how we word primary scripts.

Tasks:

  • Resolve ambiguity:

    • Enforce that all of these variations have population data AND the input population data is not minimized, eg. “az_Latn” for the AZ entry and “az_Arab” for the IR entry.

    • Output data to supplementalData.xml will remain minimized for now. likelySubtags will be the source of truth for default script.

  • Enforce consistency:

    • Make sure only these languages have multiple primaries in language_script.tsv. Languages currently listed at multiple primary: ks

    • Update tests and documentation.

  • Open question: Should we stop completely minimizing locales in the generated XML population data?

    • eg. “zh” = Simplified Chinese (Mandarin) but “zh_Hant” = Traditional Chinese (Mandarin)

    • I’d prefer to keep zh_Hans and zh_Hant here too.

Language

Script

Major Countries

Other countries

Azerbaijani [az]

Latn

AZ

AM, TR

Arab

IR

IQ, TR

Cyrl

RU

AZ

Hakka Chinese [hak]

Hans

CN

 

Hant

TW

 

Kurdish [ku]

Latn

TR, SY

DE

Arab

IQ, IR

 

Cyrl

 

AZ, AM, GE, TM

Hassaniyya [mey]

Arab

MR

All others

Latn

SN

 

Min Nan Chinese [nan]

Hans

CN

 

Hant

TW

MO

Punjabi [pa]

Arab

PK

 

Guru

IN

CA, GB, KE, SG

Cantonese [yue]

Hant

HK, TW, MO

CA, CN

Hans

CN

 

Chinese [zh]

Hans

CN

HK, MO, MY, SG

Hant

HK, TW, MO

CA, MN

Bopo

 

 

Latn

 

 

Phag

 

 

 

Notes:

  • Panjabi: Most countries are pa_Guru because the Panjabi population is largely Sikh

  • Chinese

    • Most large migrate communities left before Simplified Chinese was adopted

      • But since then, there has been considerable recent migration so the true value is not really known.

    • Changed to Simplified in Singapore many years ago

    • Malaysia was confusingly written “Chinese (Traditional) … zh” – but I confirmed and major usage in Malaysia changed to Simplified. Other countries currently written with Chinese (Traditional) may need to be changed or at least the usage of both scripts acknowledged.

  • Kurdish (technically Northern Kurdish, Kurmanji)

    • Hawar alphabet is large in Syria

    • Cyrillic was common for Kurdish in the Soviet Union – but there is no source today for those countries if it has changed

Activity

Show:

UnicodeBot last month

🛬 Merged PR

@conradarcturus merged a PR to unicode-org/cldr:main

CLDR-18114 Add explicit script markers in population data (#4463)

Mark Davis March 19, 2025 at 4:48 PM

Also add to release notes.

Annemarie Apple 🍎 March 6, 2025 at 7:11 PM

This looks like it should be in 48 not 47.

Night February 12, 2025 at 4:19 PM

Chinese (Malaysia) is still in use currently. Major should be Simplified, but Traditional is also in use so can be moved to Other Countries.

Currently pending

Annemarie Apple 🍎 November 20, 2024 at 5:30 PM

CLDR TC Decision 2024-11-20:

  • Agree with the principle that there should be an explicit primary script per each language territory combination for any languages which have multiple primary scripts.

  • Next steps: Make recommendations for assigning these for the TC review.

Details

Priority

Assignee

Reporter

Fix versions

Components

Merged

Created November 19, 2024 at 6:02 PM
Updated last month