uloc_forLanguageTag is too permissive; no error is indicated for structurally invalid BCP 47 tag such as hant-cmn-cn or zh-x_t-ab

Description

hant-cmn-cn is not a valid BCP 47 language tag.

However, uloc_forLanguageTag does not have any indication of an error. 'parsed_length" is the same as the input length (meaning the input is fully parsed) and U_FAILURE(status) is false. (for this example, I don't have to worry about the buffer being filled up without a terminating null).

input: hant-cmn-cn
after uloc_forLanguageTag : cmn_CN
after uloc_toLanguageTag : cmn-CN

Another example is zh-x_t-ab. It should be rejected (parsed_length < input_length), but it's not. This case requires another 1-line fix (subtagLength check for 'x').

This issue blocks v8 from using uloc_forLanguageTag and checking 'parsed_length' to see if a language tag is valid per BCP 47;. v8 ends up using regex to check BCP 47 structural validty.

Assignee

Jungshik Shin

Reporter

Jungshik Shin

Components

Labels

Reviewer

None

Priority

medium

Time Needed

None

Fix versions

Configure