Exemplars: make spot fixes (was Add test to improve Locale data quality)

General

Trac Info

General

Trac Info

Description

Deleted Component: other

I noticed lots of failures in http://unicode.org/cldr/utility/languageid.jsp in
the localization of language names. I ran a quick test with ICU (which uses CLDR
data) and posted the results on

http://spreadsheets.google.com/pub?key=rLxL9P8USP0HtFLakzp_A_Q&output=html

These are results gotten when trying to display just the language, script, and
region codes used in ULocale.getAvailableLocales() – that is, the ones that
would be used in displaying the localized names of the ICU available locales
themselves. So no really strange scripts, etc. I suppress the country locales in
the listing (so you just see "en" in column B, not "en-GB", "en-CA", ...

I list the results by language, script, region, and mixed, where mixed is where
the localized versions are not in the exemplar characters (main + aux). I list
the exemplar failures on the second sheet, also.

The columns for "ok" are the sum of the language, script, and region successes,
and is the sort order.

Some observations and recommendations.

The cross-script inheritance really fails badly for az-Cyrl, and for
pa_Arab (also uz_Arab, in the few strings it has). We really need to do
something about cross-script inheritance.
The exemplar tests need improving:
o It doesn't work well for zh; even when I special case it to exclude
the auxiliary exemplars. So users are not getting userful warnings. In CLDR we
might consider filtering the zh exemplars to remove traditional-only characters,
and filtering the zh-Hant ones to remove simplified only.
o In CLDR, we need to add 'ー' and other multi-script characters to
Japanese exemplars.
o We also should probably add some of the other odd characters that
are listed to the aux sets: x for Icelandic, a couple for Indonesian, and so on:
http://spreadsheets.google.com/pub?key=rLxL9P8USP0HtFLakzp_A_Q&gid=0

Activity

Show:

UnicodeBot
May 10, 2019 at 7:08 AM

Trac Comment 16 by —2012-01-15T19:37:20.000Z

Changes were:
add [彝] to the main exemplars for zh_Hant
add [å] to the aux exemplars for pl Polish, sr_Latn Serbian, ga Irish In strings: Bokmål
add [QZqxzå] to the aux exemplars for id Indonesian In a batch of language names
remove the string [ភាសាรัរូស្ស៉ី] from km Khmer (it contains 2 Thai characters)

UnicodeBot
May 10, 2019 at 7:08 AM

Trac Comment by —2009-06-24T17:12:23.000Z

moved from incoming to data

UnicodeBot
May 10, 2019 at 7:08 AM

Trac Comment by —2009-06-24T17:12:22.000Z

changed notes2

UnicodeBot
May 10, 2019 at 7:08 AM

Trac Comment by old_notes2—1970-01-01T18:12:15.000Z

Look at cross-script inheritance, exemplars, tests.

UnicodeBot
May 10, 2019 at 7:08 AM

Trac Comment by notes2—1970-01-01T00:37:35.000Z

Look at cross-script inheritance, exemplars, tests.

Resize issue view side panel

Unresolved

Details

Labels

data

Priority

major

Fix versions

Assignee

Mark Davis

Reviewer

UnicodeBot

Reporter

Mark Davis

Created January 11, 2019 at 4:41 AM

Updated November 10, 2021 at 11:00 PM

Exemplars: make spot fixes (was Add test to improve Locale data quality)

Description

Deleted Component: other

Activity

UnicodeBot May 10, 2019 at 7:08 AM

Trac Comment 16 by —2012-01-15T19:37:20.000Z

UnicodeBot May 10, 2019 at 7:08 AM

Trac Comment by —2009-06-24T17:12:23.000Z

UnicodeBot May 10, 2019 at 7:08 AM

Trac Comment by —2009-06-24T17:12:22.000Z

UnicodeBot May 10, 2019 at 7:08 AM

Trac Comment by old_notes2—1970-01-01T18:12:15.000Z

UnicodeBot May 10, 2019 at 7:08 AM

Trac Comment by notes2—1970-01-01T00:37:35.000Z

Details

Labels

Priority

Fix versions

Assignee

Reviewer

Reporter

UnicodeBot
May 10, 2019 at 7:08 AM

UnicodeBot
May 10, 2019 at 7:08 AM

UnicodeBot
May 10, 2019 at 7:08 AM

UnicodeBot
May 10, 2019 at 7:08 AM

UnicodeBot
May 10, 2019 at 7:08 AM