Exemplars: make spot fixes (was Add test to improve Locale data quality)
General
Trac Info
General
Trac Info
Description
Deleted Component: other
I noticed lots of failures in http://unicode.org/cldr/utility/languageid.jsp in the localization of language names. I ran a quick test with ICU (which uses CLDR data) and posted the results on
These are results gotten when trying to display just the language, script, and region codes used in ULocale.getAvailableLocales() – that is, the ones that would be used in displaying the localized names of the ICU available locales themselves. So no really strange scripts, etc. I suppress the country locales in the listing (so you just see "en" in column B, not "en-GB", "en-CA", ...
I list the results by language, script, region, and mixed, where mixed is where the localized versions are not in the exemplar characters (main + aux). I list the exemplar failures on the second sheet, also.
The columns for "ok" are the sum of the language, script, and region successes, and is the sort order.
Some observations and recommendations.
The cross-script inheritance really fails badly for az-Cyrl, and for pa_Arab (also uz_Arab, in the few strings it has). We really need to do something about cross-script inheritance.
The exemplar tests need improving: o It doesn't work well for zh; even when I special case it to exclude the auxiliary exemplars. So users are not getting userful warnings. In CLDR we might consider filtering the zh exemplars to remove traditional-only characters, and filtering the zh-Hant ones to remove simplified only. o In CLDR, we need to add 'ー' and other multi-script characters to Japanese exemplars. o We also should probably add some of the other odd characters that are listed to the aux sets: x for Icelandic, a couple for Indonesian, and so on: http://spreadsheets.google.com/pub?key=rLxL9P8USP0HtFLakzp_A_Q&gid=0
Activity
Show:
UnicodeBot
May 10, 2019 at 7:08 AM
Trac Comment 16 by —2012-01-15T19:37:20.000Z
Changes were: add [彝] to the main exemplars for zh_Hant add [å] to the aux exemplars for pl Polish, sr_Latn Serbian, ga Irish In strings: Bokmål add [QZqxzå] to the aux exemplars for id Indonesian In a batch of language names remove the string [ភាសាรัរូស្ស៉ី] from km Khmer (it contains 2 Thai characters)
UnicodeBot
May 10, 2019 at 7:08 AM
Trac Comment by —2009-06-24T17:12:23.000Z
moved from incoming to data
UnicodeBot
May 10, 2019 at 7:08 AM
Trac Comment by —2009-06-24T17:12:22.000Z
changed notes2
UnicodeBot
May 10, 2019 at 7:08 AM
Trac Comment by old_notes2—1970-01-01T18:12:15.000Z
Look at cross-script inheritance, exemplars, tests.
UnicodeBot
May 10, 2019 at 7:08 AM
Trac Comment by notes2—1970-01-01T00:37:35.000Z
Look at cross-script inheritance, exemplars, tests.
Deleted Component: other
I noticed lots of failures in http://unicode.org/cldr/utility/languageid.jsp in
the localization of language names. I ran a quick test with ICU (which uses CLDR
data) and posted the results on
http://spreadsheets.google.com/pub?key=rLxL9P8USP0HtFLakzp_A_Q&output=html
These are results gotten when trying to display just the language, script, and
region codes used in ULocale.getAvailableLocales() – that is, the ones that
would be used in displaying the localized names of the ICU available locales
themselves. So no really strange scripts, etc. I suppress the country locales in
the listing (so you just see "en" in column B, not "en-GB", "en-CA", ...
I list the results by language, script, region, and mixed, where mixed is where
the localized versions are not in the exemplar characters (main + aux). I list
the exemplar failures on the second sheet, also.
The columns for "ok" are the sum of the language, script, and region successes,
and is the sort order.
Some observations and recommendations.
The cross-script inheritance really fails badly for az-Cyrl, and for
pa_Arab (also uz_Arab, in the few strings it has). We really need to do
something about cross-script inheritance.
The exemplar tests need improving:
o It doesn't work well for zh; even when I special case it to exclude
the auxiliary exemplars. So users are not getting userful warnings. In CLDR we
might consider filtering the zh exemplars to remove traditional-only characters,
and filtering the zh-Hant ones to remove simplified only.
o In CLDR, we need to add 'ー' and other multi-script characters to
Japanese exemplars.
o We also should probably add some of the other odd characters that
are listed to the aux sets: x for Icelandic, a couple for Indonesian, and so on:
http://spreadsheets.google.com/pub?key=rLxL9P8USP0HtFLakzp_A_Q&gid=0