languageMatch data: language fallback distances too small

Description

The supplemental/languageInfo.xml languageMatch data includes language and language-script fallbacks (oneway="true" entries) for several types of mismatches. Some of the language fallbacks are between languages that are at most loosely related (linguistically). These fallbacks seem based on where the speakers of the desired language typically live, and assuming that many of them also understand the dominant language of that region.

For example:

  • br=Breton→French or eu=Basque→Spanish (minority→majority language)

  • uz=Uzbek→Russian (former minority)

  • rw=Kinyarwanda→French (former colony)

  • sq=Albanian→English (??)

All of these language fallbacks have a languageMatch distance of 10, which is merely twice as big as some region distances.

Compare this with zh_Hans→zh_Hant (distance=15) or zh_Hant→zh_Hans (distance=19) and note that distance deltas from language/script/region mismatches are added up. Also, the ICU LocaleMatcher uses the en-GB→en-US distance (5) as the default demotion per desired language.

Example: Assume desired=(sq, zh-Hant) & supported=(zh-Hans, en). The "best match" is English because sq-Latn-AL→en-Latn-US has a distance of 10+4=14 while zh-Hant-TW→zh-Hans-CN has distance=19+5=24. I would argue that the match between the two versions of Chinese should win.

Proposal:

  • Collect fallbacks between linguistically unrelated (or only loosely related) languages into their own section in languageInfo.xml.

  • Increase their distances.

What should the distance be?

  • It should be greater than zh_Hant→zh_Hans (19) so that a fallback does not beat out a Chinese script mismatch, even after one or maybe two per-desired-language demotions (5 each).

  • It should be smaller than 50 which is the default script distance = default match threshold.

  • It should be enough smaller than 50 so that a fallback still works after a couple of per-desired-language demotions (5 each, also considering region distances of 4 or 5).

I propose a language fallback distance of 30.

Maybe some distances should be smaller (Breton→French) or greater (Albanian→English, what is that even based on??), but a medium distance should generally be an improvement.

Also, we have languageMatch entries for language fallbacks that also cross scripts. For example, uz→ru (currently 10) plus uz-Latn→ru-Cyrl (also 10). Plus there is usually also a region distance (4 or 5). When we increase the language fallback distances, we may(?!) want to reduce the corresponding language-script distances, so that the total does not get too close to the default threshold.

PS: I am not proposing to change fallback distances between languages that are linguistically related, such as the auto-generated entries for languages encompassed in macro languages (e.g., arz→ar, yue→zh), or related languages like gsw→de, da→no.

xpath

None

locale

None

Priority

major

Assignee

Markus Scherer

Reporter

Markus Scherer

Reviewer

Mark Davis

Labels

None

Components

Fix versions

phase

rc
Configure