use more byte values in collation tailorings

Description

As far as I can tell, we never use byte values 03 and ff in primary weights, but it looks like this restriction is only necessary for the second bytes of primary weights (because they would collide with the primary-compression termination bytes), but not in the first, third and fourth bytes. This means that if we are a little smarter about the byte value ranges when tailoring, we might be able to fit a few more characters into small gaps with 2-byte or 3-byte weights.

Activity

Show:
TracBot
July 1, 2018, 12:04 AM
Trac Comment 3 by —2010-07-08T12:32:31.588Z

Note: Byte 02 is used as a separator in merged sort keys. When comparing strings, or using sort keys without merging them, 02 is harmless. Still, this is pretty bad.

Consider working on tickets #7757 (use more byte values in collation tailorings) and #7788 (CE: Tertiary Byte out of range) together, creating a more maintainable C++ class for the weight iterator and making it know about the byte value ranges in all levels.

For weight byte values see http://site.icu-project.org/design/collation/bytes

TracBot
July 1, 2018, 12:04 AM
Trac Comment 6 by —2012-02-07T20:57:08.609Z

Possible further refinement: Bytes 03 & FF are ok even in primary-weight second bytes if the weight's lead byte is not compressible. That is, we will never write the primary-compression terminators for incompressible lead bytes, and need not reserve those byte values in those cases. This could be especially useful for large Han tailorings after &[regular|last].

TracBot
July 1, 2018, 12:04 AM
Trac Comment 7 by —2012-06-29T21:12:51.501Z

Better: Byte 02 is also usable, except in

  • any-weight lead bytes

  • primary-weight second bytes for compressible lead bytes
    because the merge separator is only compared with those.

The primary-compression low-terminator is omitted at the end of the primary level, which turns into an 02 merge separator when merging, which might be compared with a primary-weight second byte, so we can't use 02 as a primary-weight second byte for compressible lead bytes.

We _can_ use 02 as a primary-weight second byte for _in_compressible lead bytes.

The 02 merge separator might also be compared with a primary-compression low-terminator of a longer sort key, so the low-terminator must still be at least 03.

It is tempting to use a _two-byte_ merge separator, so that we could use 02 in weight lead bytes. However, this would cause problems:

  • The reordering table must not modify the merge separator, so its lead byte cannot be used for reorderable weights.

  • The existing ucol_mergeSortKeys() uses single-byte 02, and I think we want to be compatible with that, so that an old ICU's mergeSortKeys() works with a newer ICU's sort keys.

If we use 02 in regular weights where possible, it becomes impossible to split a merged sort key into its pieces, but we don't promise that that is possible. Sort keys already cannot be analyzed meaningfully beyond separating out the levels.

We cannot use 01 in non-lead bytes because that would break the existing ucol_mergeSortKeys().

TracBot
July 1, 2018, 12:04 AM
Trac Comment 8 by —2013-10-25T21:08:55.076Z

done on "collv2" branch

TODO: document byte values, probably in LDML spec, maybe also ICU-specific stuff in User Guide

TracBot
July 1, 2018, 12:04 AM
Trac Comment 9 by —2014-02-27T23:48:55.271Z

I made a note elsewhere about updating the documentation.

Fixed

Assignee

Markus Scherer

Reporter

Markus Scherer

Components

Labels

Reviewer

None

Priority

medium

Time Needed

Days

Fix versions