collation: primary+case ignores primary ignorables

Description

When the current ICU code processes (strength=primary & caseLevel=on), it ignores primary-ignorable CEs ([ss, tt|0,]). It has a comment saying that "otherwise, the CEs stop being well-formed".

Is this true? I don't see how it violates the "condition 1". What is the issue?

Mark and I discussed this, and we agree that this is a bug, although it might not be detectable with a sane tailoring.

We do consider secondary weights for primary ignorables ([ss, tt|0,]), and case bits for primary and secondary ignorables ([0, tt|0,]) when strength>PRIMARY, why not case bits for primary ignorables ([ss, tt|0,]) when strengh==PRIMARY?

The code looks like this:

Activity

Show:
TracBot
July 1, 2018, 9:43 AM
Trac Comment 3 by —2012-06-01T23:26:23.927Z

although it might not be detectable

On second thought, it's easily detectable. Compare the strings

with DUCET+primary+case. The current implementation makes them compare equal because it ignores the combining marks.

When we stop ignoring primary ignorables, then we will generate 00 case bits for the combining marks (because they do have non-ignorable tertiary weights) and the shorter string sorts lower.

TracBot
July 1, 2018, 9:43 AM
Trac Comment 4 by —2012-06-05T15:48:16.097Z

I think I found the source of the current code's test. When writing the case level for a sort key, we append a single 0 bit for uncased/lowercase. In this scheme, there is no way to distinguish between two or three or four 0 bits (until we start a new case level byte). As a result, if we write case level bits for primary ignorables, then //sort keys// become ill-formed although //CEs// and string comparisons are still well-formed.

The current code writes as many case level values as it writes primary values, so that the levels have the same length and no special termination of the case level is necessary.

Part of the question is whether we want 'a-umlaut' sort the same as 'a' in primary+case. They sort the same when ignoring case bits of primary ignorables; otherwise the a-umlaut inserts one more uncased/lowercase value for the umlaut and thus sorts greater than 'a', as usual for multi-level sorting.

If we continue to ignore primary ignorables for the case level of primary+case, then we should also ignore secondary ignorables for the case level of secondary+case, so that the case level has the same number of values as the secondary level. The current code does not do that. (Luckily, secondary ignorables normally do not occur.)

If we change the behavior to true multi-level primary+case or primary+secondary+case sorting, then we need to distinguish trailing uncased/lowercase values from the end of the case level.

A simple way to do this would be to always write 2 bits per case value: 00=none, 01=lower, 10=mixed, 11=upper. (And similar for upperFirst, again with 00=none.) We would then not need the leading 1-bit in case level sort key bytes because we would start a byte with at least 01 so that the byte value would be at least 0x40. (Only 4 lowercase values would fit into a byte, rather than 7.)

Or maybe we could keep the current case level sort key encoding but add a terminator. If we always add a 1-bit, it might just work. (Or add a 1-bit unless we would have to start a new byte just for that?)

Implementation note: If we need to check for primary-ignorable when comparing case level values, then that's the only reason why up-to-tertiary string comparison would have to store primary weights (or a bit per primary weight for whether it's non-ignorable) beyond the primary-level comparison. Otherwise, up-to-tertiary string comparison can just compare primaries and store secondaries and tertiaries for later-level processing.

TracBot
July 1, 2018, 9:43 AM
Trac Comment 5 by —2012-06-29T19:52:42.883Z

Discussed with Mark. We do want primary+case to collate ä=a to match users' expectations of accent-insensitive sorting. Therefore, with primary+case we should continue to ignore case level weights of primary ignorables.

By analogy, we should change secondary+case to ignore case level weights of secondary ignorables.

Otherwise continue to only ignore case level weights of tertiary ignorables.

Sort key case level compression: We could use a self-terminating compression similar to secondary & tertiary compression, but nibble-wise not byte-wise to be compact for normal strings (so one uppercase only costs half a byte):

  • lowerFirst

  • compress 1..7 common/lower weights into the range 1..7..D

  • mixed: E, upper: F

  • upperFirst

  • upper:1, mixed: 2

  • compress 1..13 common/lower weights into the range 3..F

  • "Markus Scherer" would get a two-byte case level of F6 F6,
    whereas the current implementation would pack 16 bits into 3 bytes.

TracBot
July 1, 2018, 9:43 AM
Trac Comment 8 by —2014-03-01T22:54:40.653Z

The locale explorer/collation demo shows that the old code assigned real case bits to tertiary CEs (secondary ignorables), that they were not ignored with secondary+case, and that it could not distinguish between the absence of a case weight and a lowercase weight.

rules "&\u0001<<<b<<<B" with strength=secondary and caseLevel=on yields:

Note that "no case" and "lowercase" each add a single 0 bit to the case level, and there is no bit field terminator.

The new collation code follows the LDML 24 spec: It sets the case bits of tertiary CEs to uppercase, and it ignores tertiary CEs for secondary+case. I added this test case:

Fixed

Assignee

Markus Scherer

Reporter

Markus Scherer

Components

Labels

Reviewer

None

Priority

medium

Time Needed

Hours

Fix versions