uloc_getName returns incorrect full name for "qps_plocm"

Description

Problem:

More specifically, it adds an extra '_' while returning the result.

int32_t localeLength = uloc_getName("qps_plocm", value, 100, &status); printf("%s\n", value); //outputs qps__PLOCM -> (Two underscores('_') rather than one)

 

Cause:

The reason this occurs is because, while canonicalizing, ulocimp_getCountry returns an empty string since the country code value is > 3 (rightly so). But we append an extra '_' thinking that country code was in-fact appended (for the empty string value as well).

https://github.com/unicode-org/icu/blob/4ab713b1c6fb604d3854e7781bee2051878d6814/icu4c/source/common/uloc.cpp#L1568

 

Possible Solution:

I feel like the check for checking if country code was appended should instead check if the country was empty.

CharString country = ulocimp_getCountry(tmpLocaleID+1, &cntryID, *err); tag.append(country, *err); if (!country.isEmpty()) { /* Found optional country */ tmpLocaleID = cntryID; } if(_isIDSeparator(*tmpLocaleID)) { /* If there is something else, then we add the _ if we found country before. */ if (!_isIDSeparator(*(tmpLocaleID+1))) { // -> Use !country.isEmpty() instead ++fieldCount; tag.append('_', *err); }

Activity

Show:
Markus Scherer
October 13, 2022 at 8:56 PM

Markus Scherer mentioned “The uloc_toLanguageTag() API docs only promise “Returns a well-formed language tag for this locale ID.““

To me, the double underscore doesn’t sound like a well-formed language tag. Can someone elucidate on that?

This is confusing two separate issues.

  • The double underscore is returned by uloc_getName(). This is as designed, because this API returns ICU legacy locale IDs. uloc_toLanguageTag() returns well-formed BCP 47 language tags.

  • My comment that you quoted there was about uloc_toLanguageTag() converting to BCP 47 syntax but not also canonicalizing the locale/language.

Shawn.Steele@microsoft.com
October 13, 2022 at 6:26 PM

mentioned “The uloc_toLanguageTag() API docs only promise “Returns a well-formed language tag for this locale ID.““

To me, the double underscore doesn’t sound like a well-formed language tag. Can someone elucidate on that?

The proposed ‘fix’ for this bug that would resolve Tarek’s problem is to ‘merely’ not emit an extra underscore when information is missing. That seems like a fairly small test that should not incur much of a perf penalty. Full canonicalization is not required.

Markus Scherer
September 15, 2022 at 9:33 PM

First canonicalize() and then toLanguageTag() should work.

Tarek Ghonaim
September 15, 2022 at 9:28 PM

Also, if I cannot use uloc_toLanguageTag instead of uloc_canonicalize then there is still a problem with uloc_canonicalize which returns names with 2 underscores. What would be the alternative here?

Tarek Ghonaim
September 15, 2022 at 9:24 PM

Interesting, I tried it and saw returning canonized language tag. no?

localeLength = uloc_toLanguageTag("EN_us", value, 100, TRUE, &status); printf("%s ... %d\n", value, localeLength);

returns:

en-US ... 5
Working as Designed

Details

Assignee

Reporter

Components

Priority

Created September 2, 2022 at 1:23 PM
Updated October 13, 2022 at 8:56 PM
Resolved September 15, 2022 at 4:46 PM