We're updating the issue view to help you get more done. 

ICU Locale canonical form not following LDML spec

Description

toLanguageTag() is defined as following the BCP47 spec. I've not seen it specify all the same things as the Unicode LDML specification at
http://www.unicode.org/reports/tr35/, however some things specified in the latter are not incompatible with BCP47 and I think should be implemented when canonicalizing Language Tags:

1) Sorting of variants: "en-scouse-fonipa" -> "en-fonipa-scouse"

2) Dropping of "true" in u extensions:
"und-u-foo-bar-nu-thai-ca-buddhist-kk-true" -> "u-bar-foo-ca-buddhist-kk-nu-thai"

Those two changes are required for toLanguageTag() to not break the LDML specification on this point.

Observing other differences in Unicode LDML Locale Identifiers and the specifically-BCP47 implementation of forLanguageTag/toLanguageTag - some details might be "working as intended", others might require either specification fix or improvements to code behaviour:

  • toLanguageTag can return "root", which is not a valid language subtag. (It's a special case, which in LMDL spec gets replaced with "und" when producing canonical Unicode BCP47 Locale Identifiers.),

  • forLanguageTag doesn't support underscores,

  • forLanguageTag accepts the zero-length string as valid (which seems to not be a valid LanguageTag) - produces 'und' for this,

  • forLanguageTag does not permit "en-a", "en-z", "en-x". The LDML spec rejects "en-t" and "en-u" (cannot be empty), but the spec allows empty extensions for the other 24 singletons,

  • Deprecated items handling...

For deprecated items:

  • The spec gives some script and variant deprecations in tables. These deprecations are in supplementalMetadata.xml as scriptAlias and variantAlias, perhaps worth mentioning in the spec? (And how about subdividionAlias and zoneAlias?)

  • Except for POSIX, the variant deprecations aren't implemented.

  • The spec suggests languageAlias should be able to influence more than just language subtags (e.g. "mo" -> "ro-MD"), but the implementation replaces only language subtags ("mo" -> "ro").

Environment

xpath

None

locale

None

Status

Assignee

Mark Davis

Reporter

TracBot

tracReporter

hugovdm@1d5920f4b44b27a8

tracOwner

mark

tracResolution

fixed

tracStatus

closed

Reviewer

Yoshito Umaoka

phase

dsub

tracCreated

Feb 27, 2019, 5:24 PM

Components

Fix versions

Priority

major