uloc_getName returns incorrect full name for "qps_plocm"

General

Other Data

General

Other Data

Description

Problem:

More specifically, it adds an extra '_' while returning the result.

int32_t localeLength = uloc_getName("qps_plocm", value, 100, &status);
printf("%s\n", value); //outputs qps__PLOCM -> (Two underscores('_') rather than one)

Cause:

The reason this occurs is because, while canonicalizing, ulocimp_getCountry returns an empty string since the country code value is > 3 (rightly so). But we append an extra '_' thinking that country code was in-fact appended (for the empty string value as well).

https://github.com/unicode-org/icu/blob/4ab713b1c6fb604d3854e7781bee2051878d6814/icu4c/source/common/uloc.cpp#L1568

Possible Solution:

I feel like the check for checking if country code was appended should instead check if the country was empty.

CharString country = ulocimp_getCountry(tmpLocaleID+1, &cntryID, *err);
tag.append(country, *err);
if (!country.isEmpty()) {
    /* Found optional country */
    tmpLocaleID = cntryID;
}
if(_isIDSeparator(*tmpLocaleID)) {
    /* If there is something else, then we add the _  if we found country before. */
    if (!_isIDSeparator(*(tmpLocaleID+1))) {  // -> Use !country.isEmpty() instead
        ++fieldCount;
        tag.append('_', *err);
    }

Activity

Show:

Markus Scherer

October 13, 2022 at 8:56 PM

Markus Scherer mentioned “The uloc_toLanguageTag() API docs only promise “Returns a well-formed language tag for this locale ID.““
To me, the double underscore doesn’t sound like a well-formed language tag. Can someone elucidate on that?

This is confusing two separate issues.

The double underscore is returned by uloc_getName(). This is as designed, because this API returns ICU legacy locale IDs. uloc_toLanguageTag() returns well-formed BCP 47 language tags.
My comment that you quoted there was about uloc_toLanguageTag() converting to BCP 47 syntax but not also canonicalizing the locale/language.

Shawn.Steele@microsoft.com

October 13, 2022 at 6:26 PM

@Markus Scherer mentioned “The uloc_toLanguageTag() API docs only promise “Returns a well-formed language tag for this locale ID.““

To me, the double underscore doesn’t sound like a well-formed language tag. Can someone elucidate on that?

The proposed ‘fix’ for this bug that would resolve Tarek’s problem is to ‘merely’ not emit an extra underscore when information is missing. That seems like a fairly small test that should not incur much of a perf penalty. Full canonicalization is not required.

Markus Scherer

September 15, 2022 at 9:33 PM

First canonicalize() and then toLanguageTag() should work.

Tarek Ghonaim

September 15, 2022 at 9:28 PM

Also, if I cannot use uloc_toLanguageTag instead of uloc_canonicalize then there is still a problem with uloc_canonicalize which returns names with 2 underscores. What would be the alternative here?

Tarek Ghonaim

September 15, 2022 at 9:24 PM

Interesting, I tried it and saw returning canonized language tag. no?

    localeLength = uloc_toLanguageTag("EN_us", value, 100, TRUE, &status);
    printf("%s ... %d\n", value, localeLength);

returns:

en-US ... 5

Resize issue view side panel

Working as Designed

Details

Assignee

Unassigned

Reporter

Rahul Pandey

Components

locale_id

Priority

assess

Created September 2, 2022 at 1:23 PM

Updated October 13, 2022 at 8:56 PM

Resolved September 15, 2022 at 4:46 PM