Exemplars: make spot fixes (was Add test to improve Locale data quality)

Description

Deleted Component: other

I noticed lots of failures in http://unicode.org/cldr/utility/languageid.jsp in
the localization of language names. I ran a quick test with ICU (which uses CLDR
data) and posted the results on

http://spreadsheets.google.com/pub?key=rLxL9P8USP0HtFLakzp_A_Q&output=html

These are results gotten when trying to display just the language, script, and
region codes used in ULocale.getAvailableLocales() – that is, the ones that
would be used in displaying the localized names of the ICU available locales
themselves. So no really strange scripts, etc. I suppress the country locales in
the listing (so you just see "en" in column B, not "en-GB", "en-CA", ...

I list the results by language, script, region, and mixed, where mixed is where
the localized versions are not in the exemplar characters (main + aux). I list
the exemplar failures on the second sheet, also.

The columns for "ok" are the sum of the language, script, and region successes,
and is the sort order.

Some observations and recommendations.

  • The cross-script inheritance really fails badly for az-Cyrl, and for
    pa_Arab (also uz_Arab, in the few strings it has). We really need to do
    something about cross-script inheritance.

  • The exemplar tests need improving:
    o It doesn't work well for zh; even when I special case it to exclude
    the auxiliary exemplars. So users are not getting userful warnings. In CLDR we
    might consider filtering the zh exemplars to remove traditional-only characters,
    and filtering the zh-Hant ones to remove simplified only.
    o In CLDR, we need to add 'ー' and other multi-script characters to
    Japanese exemplars.
    o We also should probably add some of the other odd characters that
    are listed to the aux sets: x for Icelandic, a couple for Indonesian, and so on:
    http://spreadsheets.google.com/pub?key=rLxL9P8USP0HtFLakzp_A_Q&gid=0

Activity

Show:

UnicodeBot 
May 10, 2019 at 7:08 AM

Trac Comment 16 by —2012-01-15T19:37:20.000Z

Changes were:
add [彝] to the main exemplars for zh_Hant
add [å] to the aux exemplars for pl Polish, sr_Latn Serbian, ga Irish In strings: Bokmål
add [QZqxzå] to the aux exemplars for id Indonesian In a batch of language names
remove the string [ភាសាรัរូស្ស៉ី] from km Khmer (it contains 2 Thai characters)

UnicodeBot 
May 10, 2019 at 7:08 AM

Trac Comment by —2009-06-24T17:12:23.000Z

moved from incoming to data

UnicodeBot 
May 10, 2019 at 7:08 AM

Trac Comment by —2009-06-24T17:12:22.000Z

changed notes2

UnicodeBot 
May 10, 2019 at 7:08 AM

Trac Comment by old_notes2—1970-01-01T18:12:15.000Z

Look at cross-script inheritance, exemplars, tests.

UnicodeBot 
May 10, 2019 at 7:08 AM

Trac Comment by notes2—1970-01-01T00:37:35.000Z

Look at cross-script inheritance, exemplars, tests.

Unresolved

Details

Labels

Priority

Fix versions

Assignee

Reviewer

Reporter

Created January 11, 2019 at 4:41 AM
Updated November 10, 2021 at 11:00 PM