LocaleMatcher able to reject one-way matches

Description

The LocaleMatcher uses CLDR languageMatch data which includes fallback (oneway=true) entries. Sometimes it is desirable to ignore those.

For example, consider a web application with the UI in a given language, with a link to another, related web app. The link should include the UI language, and the target server may also use the client's Accept-Language header data. The target server has its own list of supported languages. We may want to favor UI language consistency, that is, if there is a decent match for the original UI language, we want to use it, but not if it is merely a fallback.

Let's say the UI language is Albanian, Accept-Language is French, supported is (English, French). It seems best to look for the UI language only first, but getBestMatch(Albanian) is English due to a CLDR fallback. It seems much better to detect this fallback and try a wider match, for example passing in the whole list of UI+Accept languages and allow fallbacks on that; French would win.

The LanguageMatcher has a match() function which returns the inverse of the distance. We could do a reverse check ("best match"→UI language) and see if that is also below the threshold. However, the match() function is clunky, the match/distance numbers are subject to implementation details, and we don't even expose the threshold value, so match() is suitably deprecated.

We could add a new function like isMatch(desired, supported) for cleaner API. Problem: If we have a fallback with a small distance and a roundtrip with a larger (but still acceptable) distance, getBestMatch() would still return the fallback, and we would discard it; we would never see the roundtrip match result.

So it would be better to skip over fallback matches fairly deeply inside the getBestMatch() implementation.

Remember that class LocaleMatcher is immutable.

Easiest would be to make it a build-time option. Problem: In a use case like above, we would need two LocaleMatcher objects that differ only in this option.

I think we should make it a per-call option. We would need some getBestMatch() variants with this behavior. We could add overloads with another argument, but as long as we don't plan to add yet another behavior variation we could use methods with distinct names (e.g., "Roundtrip" infix/suffix) and the same argument lists. In Java, LocaleMatcher has eight getBest...(...) functions. I suggest we add "roundtrip" options only to the four that take locale Iterables, or even only to the two that also return a Result.


What do we mean with "roundtrip"?

Inside getBestMatch() we compute the distance for each (desired, supported) language pair. If this distance is smaller than the previous best match, we update that. Except with the new option we would double-check the distance where we swap desired and supported languages.

Literally, "roundtrip" could mean that the reverse distance is the same as the forward distance. That seems stricter than necessary. For example, zh_Hans→zh_Hant and zh_Hant→zh_Hans have different distances.

We could check that the reverse distance is below the matcher's threshold; this is closer to checking isMatch() after getBestMatch().

We could check that the reverse distance is below the current threshold considering the per-desired-locale demotion so far that we looked at for the forward distance. (Depending on the implementation, the demotion either adds to each distance or lowers the threshold.)

Right now, I am not quite sure between these two, but leaning towards the latter one.

In addition, with that latter choice (reverse distance below demoted threshold), we may(?!) use the max(forward, reverse) distance (not just the forward distance) for determining if this is a new best match.

Assignee

Markus Scherer

Reporter

Markus Scherer

Components

Labels

None

Reviewer

None

Priority

medium

Time Needed

Days

Fix versions

Configure