uloc_forLanguageTag is too permissive; no error is indicated for structurally invalid BCP 47 tag such as hant-cmn-cn or zh-x_t-ab

Description

hant-cmn-cn is not a valid BCP 47 language tag.

However, uloc_forLanguageTag does not have any indication of an error. 'parsed_length" is the same as the input length (meaning the input is fully parsed) and U_FAILURE(status) is false. (for this example, I don't have to worry about the buffer being filled up without a terminating null).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 std::string canonicalizeLoc(const std::string& locale_id, std::string* icu_loc) { const char* const kInvalidTag = "invalid-tag"; UErrorCode error = U_ZERO_ERROR; char icu_result[ULOC_FULLNAME_CAPACITY]; int parsed_length = 0; size_t input_len = locale_id.length(); uloc_forLanguageTag(locale_id.data(), icu_result, ULOC_FULLNAME_CAPACITY, &parsed_length, &error); if (U_FAILURE(error) || parsed_length < input_len) { *icu_loc = std::string("invalid_icu_ocale"); return std::string(kInvalidTag); } //*icu_loc = std::string(icu_result, icu_length); *icu_loc = std::string(icu_result); char result[ULOC_FULLNAME_CAPACITY]; // Force strict BCP47 rules. uloc_toLanguageTag(icu_result, result, ULOC_FULLNAME_CAPACITY, TRUE, &error); if (U_FAILURE(error)) { return std::string(kInvalidTag); } return std::string(result); }

input: hant-cmn-cn
after uloc_forLanguageTag : cmn_CN
after uloc_toLanguageTag : cmn-CN

Another example is zh-x_t-ab. It should be rejected (parsed_length < input_length), but it's not. This case requires another 1-line fix (subtagLength check for 'x').

1 2 3 input: zh-x_t-ab after uloc_forLanguageTag: zh@x=ab after uloc_toLanguageTag: zh-x-ab

This issue blocks v8 from using uloc_forLanguageTag and checking 'parsed_length' to see if a language tag is valid per BCP 47;. v8 ends up using regex to check BCP 47 structural validty.

Status

Assignee

Jungshik Shin

Reporter

Jungshik Shin

Labels

Reviewer

None

Time Needed

None

Start date

None

Components

Fix versions

Priority

medium