collation: reorder single scripts

Description

I propose that we enable single-script reordering, rather than reordering scripts in the current groups. This would solve a few problems, at minimal cost.

None of this changes anything about the space, punct, symbol, currency, digit, Latn, Hani reordering groups.

We currently take the Unicode scripts (alphabets etc.) in DUCET order, declare each Recommended Script as a sort of "anchor script", and create groups of scripts such that each group starts with such an anchor script. We give each group one primary-weight lead byte. We document script reordering as building a permutation of primary lead bytes.

We group scripts together because there are too many Unicode scripts to give each one a whole lead byte, and a lead byte permutation is very simple.

Issues:

  • The DUCET script order, together with the set of Recommended Scripts, causes very imbalanced groups of scripts (see the "top_byte" table in FractionalUCA.txt). The largest ones tend to "overflow", requiring splits at not-Recommended Scripts (we added Cherokee as an anchor script in CLDR 24), or smaller gaps between primaries than we would like.

  • More scripts will be added to Unicode, so we will have to revisit this again.

  • Several Recommended Scripts get a whole lead byte but have only a small number of primary weights.

  • Some of the groups contain unrelated scripts.

  • We like to group related scripts together, so that they move together; on the other hand, one might prefer a different order specifically of related scripts (e.g., among the Philippine scripts) which is not currently possible.

  • It is difficult to come up with a script order that is much "better", because relationships between scripts are complicated, and the Recommended Scripts are not the best anchors from a relatedness perspective.

If we reorder single scripts, then we do not need to justify the groups, we can freely allocate appropriate portions of the primary weight space, we do not need "related" scripts next to each other (and figure out what "related" means), and we do not need to care about the default order of scripts. Usability and documentation would be simpler.

In FractionalUCA.txt, I propose that we use whole bytes for a few very common scripts, and allocate one or more sixteenth of a lead byte for each of the other scripts. Script reordering would index by the top 12 primary bits. (This can be a small table by using a single offset value for whole lead bytes, and 16 values only for split bytes that do not all move by the same offset.)

For an implementation (like ICU) that writes sort keys as byte sequences, the reordering offset needs to be by whole bytes to avoid problems (with single-byte primaries, primary compression, and sort key byte validity). Reordering partial-byte scripts can be done by splitting the scripts that share such lead bytes, for which a small number of lead bytes would be reserved. Reorderings could not be completely arbitrary in that case, but it would be much more flexible than reordering whole groups.

Some scripts that currently use less than a sixteenth of a lead byte would use more space, but that is balanced by reducing some small scripts from whole bytes to a few sixteenths. (We would continue to use two-byte primary weights for almost all of the Recommended Scripts that use them now.)

xpath

None

locale

None

Activity

Show:
TracBot
May 10, 2019, 1:31 AM
Trac Comment 4 by —2015-01-03T19:13:50.328Z

UnicodeTools changes see http://unicode.org/edcom/bugtrack/changeset/748/unicodetools

TracBot
May 10, 2019, 1:31 AM
Trac Comment 5 by —2015-01-03T21:12:57.835Z

Scripts can start on any two-byte boundary. High-frequency scripts use whole lead bytes, for fast lead byte permutation. ICU will support split lead bytes via a list of primary-weight ranges.

TracBot
May 10, 2019, 1:31 AM
Trac Comment 6 by —2015-01-07T00:46:17.001Z

Note: Before Unicode 5.2, FractionalUCA.txt always used a whole primary lead byte per script. If script reordering had been specified at that time, it would have naturally reordered single scripts rather than groups.

TracBot
May 10, 2019, 1:31 AM
Trac Comment 7 by —2015-01-07T03:45:56.696Z

Further notes on script allocation and implementation notes see .

Priority

medium

Assignee

Markus Scherer

Reporter

Markus Scherer

Reviewer

Mark Davis

Labels

None

Components

Fix versions

phase

rc
Configure