reimplement acceptLanguage() using the LocaleMatcher

Description

We have old API for locale matching: ULocale.acceptLanguage() and uloc_acceptLanguage().
"Based on a list of available locales, determine an acceptable locale for the user."

This is what the newer LocaleMatcher does, but the LocaleMatcher uses a newer, more sophisticated algorithm backed by CLDR locale-distance data. (Plus it is much more efficient.)

I propose that we replace the old acceptLanguage() implementation code with a thin wrapper over LocaleMatcher. We would get better behavior and remove redundant functionality.


The acceptLanguage() functions optionally return a bit of information to distinguish between an exact match and a "fallback" (e.g., matching supported "ja" with desired "ja_JP"). Simplest would be to compare the best-match return value with the corresponding supported locale. The LocaleMatcher does look for exact matches, but of LSRs not of full locales.


There are variants of the acceptLanguage() functions (Java overload/C uloc_acceptLanguageFromHTTP()) that take an HTTP Accept-Language string for the desired locales. They have their own parsers that don't (at least in C++) quite seem to adhere to the spec. In Java, we have public class LocalePriorityList which the LocaleMatcher uses. In C++, I just wrote an internal version of that for the LocaleMatcher port.

I propose that we use LocalePriorityList rather than another implementation specific to acceptLanguage().


In Java, I have not modified the LocalePriorityList behavior. It throws exceptions for some syntax errors but does not validate language tags. The spec requires dash-separated subtags of at most 8 alphanum characters. In C++, I validate them using LocaleBuilder::setLanguageTag().

We could decide to have Java and C++ Accept-Language parsing have similar or different strictness.

We could decide to accept

  • only well-formed language tags (LocaleBuilder::setLanguageTag()), or

  • accept language tags as far as possible (Locale::forLanguageTag()), or

  • anything including ICU legacy locale ID strings (Locale(string) constructor).

Activity

Show:
Mark Davis
July 14, 2019, 9:19 PM

On leniency, I’d suggest an enum with 2 values:

  1. well-formed only

  2. lenient (anything accepted by Locale::forLanguageTag() or Locale(string) constructor.

 

Markus Scherer
July 16, 2019, 12:00 AM

Ok. What should be the default? Strict or lenient?

Markus Scherer
July 24, 2019, 3:09 AM

I changed my C++ LocalePriorityList code to also use the Locale constructor, so no language tag validation. Needed to make the LocaleMatcher data-driven test work.

Nathan Hammond
February 20, 2020, 9:05 AM

This is a much-needed change in order to make {{acceptLanguage}} viable to use. Given an {{Accept-Language}} header of {{"zh-HK"}} the calculation in {{acceptLanguage}} will return {{zh}}: Mandarin in simplified characters. However, the ideal selection that is relatively likely to be available would probably be {{zh-TW}} which will use Min in traditional characters and is more-similar to Yue (Cantonese).

Nathan Hammond
March 10, 2020, 5:07 PM

I threw a slightly late review on the PR, my apologies for not getting to it prior to landing.

https://github.com/unicode-org/icu/pull/1022#pullrequestreview-372118729

Assignee

Markus Scherer

Reporter

Markus Scherer

Components

Labels

None

Reviewer

None

Priority

major

Time Needed

Days

Fix versions

Configure