uloc_getName returns incorrect full name for "qps_plocm"
Description
Activity
Markus Scherer mentioned “The uloc_toLanguageTag() API docs only promise “Returns a well-formed language tag for this locale ID.““
To me, the double underscore doesn’t sound like a well-formed language tag. Can someone elucidate on that?
This is confusing two separate issues.
The double underscore is returned by uloc_getName(). This is as designed, because this API returns ICU legacy locale IDs. uloc_toLanguageTag() returns well-formed BCP 47 language tags.
My comment that you quoted there was about uloc_toLanguageTag() converting to BCP 47 syntax but not also canonicalizing the locale/language.
@Markus Scherer mentioned “The uloc_toLanguageTag() API docs only promise “Returns a well-formed language tag for this locale ID.““
To me, the double underscore doesn’t sound like a well-formed language tag. Can someone elucidate on that?
The proposed ‘fix’ for this bug that would resolve Tarek’s problem is to ‘merely’ not emit an extra underscore when information is missing. That seems like a fairly small test that should not incur much of a perf penalty. Full canonicalization is not required.
First canonicalize() and then toLanguageTag() should work.
Also, if I cannot use uloc_toLanguageTag
instead of uloc_canonicalize
then there is still a problem with uloc_canonicalize
which returns names with 2 underscores. What would be the alternative here?
Interesting, I tried it and saw returning canonized language tag. no?
localeLength = uloc_toLanguageTag("EN_us", value, 100, TRUE, &status);
printf("%s ... %d\n", value, localeLength);
returns:
en-US ... 5
Problem:
More specifically, it adds an extra '_' while returning the result.
int32_t localeLength = uloc_getName("qps_plocm", value, 100, &status); printf("%s\n", value); //outputs qps__PLOCM -> (Two underscores('_') rather than one)
Cause:
The reason this occurs is because, while canonicalizing, ulocimp_getCountry returns an empty string since the country code value is > 3 (rightly so). But we append an extra '_' thinking that country code was in-fact appended (for the empty string value as well).
https://github.com/unicode-org/icu/blob/4ab713b1c6fb604d3854e7781bee2051878d6814/icu4c/source/common/uloc.cpp#L1568
Possible Solution:
I feel like the check for checking if country code was appended should instead check if the country was empty.
CharString country = ulocimp_getCountry(tmpLocaleID+1, &cntryID, *err); tag.append(country, *err); if (!country.isEmpty()) { /* Found optional country */ tmpLocaleID = cntryID; } if(_isIDSeparator(*tmpLocaleID)) { /* If there is something else, then we add the _ if we found country before. */ if (!_isIDSeparator(*(tmpLocaleID+1))) { // -> Use !country.isEmpty() instead ++fieldCount; tag.append('_', *err); }