Ignorable code points get an incorrect weight at strength 4 with partial sort keys

Description

When using ucol_nextSortkeyPart() to generate weights with a strength 4 collation, ignorable characters are given weight 0xFF as their strength 4 weight. This problem does not occur with ucol_getSortKey(). This causes strings to be determined as collating equal when using ucol_getSortKey(), but collating unequal when using ucol_nextSortkeyPart().

Here is an example program that demonstrates this.

Output on ICU 3.2.1:

Output on ICU 3.8:

Activity

Show:
TracBot
June 30, 2018, 11:26 PM
Trac Comment description.2 by —2007-11-09T21:18:43.000Z

I looked at ICU's code and I think I have figured out the issue -
on the quaternary level, if currently processed is a completely ignorable CE (one that equals zero), there should be no addition to the quaternary level. Right now, a 0xFF or a Hiragana quaternary value always gets added.

Likely fix is to change line 6074 of ucol.cpp so that ignorable characters don't get added.

Fixed

Assignee

TracBot

Reporter

TracBot

Components

Labels

None

Reviewer

None

Priority

major

Time Needed

Weeks

Fix versions