Have model language as a preference and do casing test on a per item basis


For SurveyTool for CLDR 1.8 submissions there is a new feature that
tries to encourage casing (uppercase/lowercase) consistency within
certain sets of locale data within each locale.

However, this approach, even if it can be slightly improved, has
a number of problems.

1) It is within each locale only. There is no comparison to other
locales that may have similar rules (and more generally, similar
translations). The only other locale data compared to is English,
and English has rather unique rules for casing.

2) For some datasets (in particular currencies and eras) it may
well be the case that the data for a locale may quite appropriately
have casing variations.

3) Using English only as model language gives the impression that
most things are supposed to be given with an initial capital letter,
especially since the English data inappropriately uses uppercase initial
also for words that are not normally written with uppercase initial.
This is very likely the root cause for many of the casing problems
that are present in the CLDR data.

4) Giving warnings for casing "problems" as done now often gives the
warning inappropriately for data that is cased properly. This may lead
to inappropriate changes.

Instead, I would suggest not using the current approach to casing tests,
not even a slightly improved version.

In its place I would suggest the following:

A) Allow SurveyTool users to set model language as a user preference.
English is used as default model language (but can be chosen
explicitly, explicit and default English are different, see below).

B) Having chosen a model language, the "currently winning" data items
for that language are used instead of English data for the data to
be translated.

C) Having chosen a model language explicitly, enables the per item casing
test (can be disabled as a user preference) if both the model and the
target locales's scripts are cased.
This casing test can be turned off completely as a user preference.
The casing test compares the case of the first letter of the model
language data item (the currently winning one) and the case of the first
letter of the targe language data item (the currently winning one).
If different, and the test has not been disabled (per user preference),
a mild warning on casing mismatch is given. It should be made clear that
that this is only a raw test, and need not indicate an actual error.

D) In addition to using default English, users should be encouraged to
select a model language related to the target language, to not only to
make casing comparisons, but also translation comparisons, in order
to try to consolidate translations.

Note that this bug report supersedes bug report 1693.






Steven R. Loomis
October 24, 2019, 10:22 PM

said in cldr-1693 :

It would be good to be able to select as a personal, and maybe temporary,
preference another language (or rather locale) as the source for the second and
third coloumn data. Currently it is always "en_US" data that is shown. This
would make it a bit easier to get consistent data submissions for locales for
related languages (e.g. Swedish/Norwegian, Dutch/Afrikaans, and others). As a
translator one can then, e.g. easily look at Nynorsk data while filling in or
checking/vetting Norwegian Bokmål data, and many other similar cases where
consistency is a high priority.


May 10, 2019, 6:59 AM
Trac Comment 9 by —2014-02-03T16:49:55.738Z

Merging future and UNSCH

May 10, 2019, 6:59 AM
Trac Comment 8 by —2012-01-19T18:13:41.000Z

This suggest some additional ways to test for casing consistency (in addition to those added in that we may still want to consider.

May 10, 2019, 6:59 AM
Trac Comment 3 by —2010-08-11T21:50:33.000Z

The problem is that the model language gets special treatment. We make sure that it gets proper updates from ISO, that it reflects new syntax and structure, etc. If a user chooses their own model language, they may enter an incorrect translation if the model language itself is incorrect.

So, I think an alternate language should ONLY be used as an addition, and potentially should be restricted to a certain set.

The Survey Tool does NOT dynamically reflect changes to the model language. Having a model language which can be in flux would produce incorrect results depending on when a user viewed the comparison.




Peter Edberg








Fix versions