All work

Select view

Select search mode

 
50 of

Collation version not bumped in ICU-73

Description

With ICU 72 and ICU 73, ucol_getVersion() for the collation associated to the 'und' locale returns "153.120".

Yet some strings appear to sort differently across these versions.

For instance when comparing "A‘B" and "A‚2", where the 2nd characters are respectively U+201A (SINGLE LOW-9 QUOTATION MARK) and U+2018 (LEFT SINGLE QUOTATION MARK):

  • with ICU 72 ucol_strcollUTF8() returns -1

  • with ICU 73 ucol_strcollUTF8() returns +1

According to git bisect, the behavior has changed at this commit:

commit d86b1cebe192004759b6c875b0f831b97ccdae34

Author: Markus Scherer <markus.icu@gmail.com>

Date: Wed Feb 22 15:14:23 2023 -0800

ICU-22220 update root collation from CLDR 43

It makes sense that a new CLDR version might change the sort order, but are the version numbers of the affected collations not required to change in that case?

In practice the problem is observed with PostgreSQL which uses collation versions to determine whether btree indexes should be rebuilt following an ICU upgrade. Skipping index rebuilding may cause data corruption when upgrading to ICU 73.

Details

Assignee

Reporter

Components

Priority

Time Needed

Hours

Fix versions

Created October 12, 2023 at 9:14 PM
Updated February 25, 2025 at 9:49 PM

Activity

Show:

Anju Kaushik 
January 9, 2025 at 9:53 PM

Hi

ucol_getVersion() - I am seeing same value for different locales . Is it correct?

For example, ICU 74.2 - the root locale and LZH_KBIG5HAN_AN_CX_EX_FX_HX_NX_S3 show the same collation version.

ICU 74.2: Testing ucol_getUCAVersion() result: 153.121.0.0

If there is change in collation version ( value from ucol_getUCAVersion()) in 2 different ICUs version, how can we learn there is any/no impact on a particular locale ( collator) - Do you have any API (or tool) to find collation version (collation change impact) on locale level?

Markus Scherer 
October 19, 2023 at 5:59 PM

icu4c/source/data/coll/root.txt includes the CLDR version number, but I think that for ““=”root”=”und” we only load ucadata.icu and don’t bother also loading the root.res file.

Markus Scherer 
October 19, 2023 at 5:55 PM

PS: I think this is the first time (at least for over ten years) that we changed the root sort order without upgrading to a whole new Unicode version.

Markus Scherer 
October 19, 2023 at 5:52 PM

Sorry about that!

Yes, CLDR 43 / ICU 73 changed the sort order, as documented on the 73 release page: On look for “collation”.

I am pretty sure that the Unicode version number feeds into the ucol_getVersion() return value, but this change was basically a cherry-pick from the then-future Unicode 15.1 change – and deliberately so, because the UTC wanted to see if this would cause problems before releasing the change in 15.1.

ICU 74, currently available as a release candidate, of course changes collation again, for Unicode 15.1.

I thought that the CLDR or ICU version number also feeds into the ucol_getVersion() return value; I will check. It might not be included for the root (und) collation if the version number comes from a tailoring resource bundle.