Locale::forLanguageTag() lost other value in -x while there are "lvariant" in it.

Description

UErrorCode error = U_ZERO_ERROR;
Locale l = Locale::forLanguageTag("en-US-x-test-lvariant-var", error);
l.getName() return "en_US_VAR"

It should return "en_US_VAR@x=test" instead.

It show somehow while there are "lvariant" in the x , forLanguageTag ignore other values in the -x

Activity

Show:
Frank Yung-Fong Tang
February 6, 2019, 7:41 PM
Edited

Currently the Java test expect this

{"B", "en-US-x-test-lvariant-var", "T", "en-US-x-test-lvariant-var", "en_US_VAR@x=test"},

https://github.com/unicode-org/icu/blob/778d0a6d1d46faa724ead19613bda84621794b72/icu4j/main/tests/core/src/com/ibm/icu/dev/test/util/LocaleBuilderTest.java

So either the current C++ implementation is wrong or the current Java tests (and implementation) is wrong about the outcome.

Frank Yung-Fong Tang
February 6, 2019, 7:43 PM

to be clear, in my original report, while I state "It should return "en_US_VAR@x=test" instead." I assume the Java tests/implementation is correct and treat them as the ground truth.

Yoshito Umaoka
February 21, 2019, 3:39 AM

The behavior of ICU4J is matching the original design. The JDK API doc Locale#forLanguageTag explains the expected behavior, and this is also the original spec of ICU4J corresponding.

The portion of a private use subtag prefixed by "lvariant", if any, is removed and appended to the variant field in the result locale (without case normalization). If it is then empty, the private use subtag is discarded:
Locale loc;
loc = Locale.forLanguageTag("en-US-x-lvariant-POSIX");
loc.getVariant(); // returns "POSIX"
loc.getExtension('x'); // returns null

loc = Locale.forLanguageTag("de-POSIX-x-URP-lvariant-Abc-Def");
loc.getVariant(); // returns "POSIX_Abc_Def"
loc.getExtension('x'); // returns "urp"

Markus Scherer
May 28, 2019, 11:17 PM

The bug report says that the part between the 'x' and the "lvariant" is not preserved. Should it be preserved?

Yoshito Umaoka
May 29, 2019, 1:51 PM

The API reference doc might not be clear for the case. But the design intent and current ICU4J implementation is to interpret privateuse subtag followed by “lvariant” as variant, and subtags between “x” and “lvariant” as privateuse keyword.

 

For example:

  • en-x-abc → en@x=abc

  • en-x-lvariant-variant1 → en_VARIANT1

  • en-x-abc-lvariant-variant1 → en_VARIANT1@x=abc

  • en-x-abc-def-lvariant-variant1-variant2 → en_VARIANT1_VARIANT2@x=abc-def

 

There is one edge case - lvariant followed by no subtags. In this case, lvariant is interpreted as a part of private-use.

 

For example:

  • en-x-lvariant → en@x=lvariant

  • en-x-abc-lvariant → en@x=abc-lvariant

 

Assignee

Yoshito Umaoka

Reporter

Frank Yung-Fong Tang

Components

Reviewer

None

Priority

assess

Time Needed

None

Fix versions

Configure