We're updating the issue view to help you get more done. 

uloc_forLanguageTag is too permissive; no error is indicated for structurally invalid BCP 47 tag such as hant-cmn-cn or zh-x_t-ab


hant-cmn-cn is not a valid BCP 47 language tag.

However, uloc_forLanguageTag does not have any indication of an error. 'parsed_length" is the same as the input length (meaning the input is fully parsed) and U_FAILURE(status) is false. (for this example, I don't have to worry about the buffer being filled up without a terminating null).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 std::string canonicalizeLoc(const std::string& locale_id, std::string* icu_loc) { const char* const kInvalidTag = "invalid-tag"; UErrorCode error = U_ZERO_ERROR; char icu_result[ULOC_FULLNAME_CAPACITY]; int parsed_length = 0; size_t input_len = locale_id.length(); uloc_forLanguageTag(locale_id.data(), icu_result, ULOC_FULLNAME_CAPACITY, &parsed_length, &error); if (U_FAILURE(error) || parsed_length < input_len) { *icu_loc = std::string("invalid_icu_ocale"); return std::string(kInvalidTag); } //*icu_loc = std::string(icu_result, icu_length); *icu_loc = std::string(icu_result); char result[ULOC_FULLNAME_CAPACITY]; // Force strict BCP47 rules. uloc_toLanguageTag(icu_result, result, ULOC_FULLNAME_CAPACITY, TRUE, &error); if (U_FAILURE(error)) { return std::string(kInvalidTag); } return std::string(result); }

input: hant-cmn-cn
after uloc_forLanguageTag : cmn_CN
after uloc_toLanguageTag : cmn-CN

Another example is zh-x_t-ab. It should be rejected (parsed_length < input_length), but it's not. This case requires another 1-line fix (subtagLength check for 'x').

1 2 3 input: zh-x_t-ab after uloc_forLanguageTag: zh@x=ab after uloc_toLanguageTag: zh-x-ab

This issue blocks v8 from using uloc_forLanguageTag and checking 'parsed_length' to see if a language tag is valid per BCP 47;. v8 ends up using regex to check BCP 47 structural validty.




Jungshik Shin


Jungshik Shin



Fix versions