Collation folding

Description

The Unicode Collation Algorithm spec has a section on Collation Folding, describing how to map one string (such as "résumé") to another string (such as "resume") for searching text that is equivalent at a certain strength level.

This proposal is to add an ICU API that enables doing collation folding. Or, at a minimum, an API that provides access the CLDR data needed to implement collation folding.

(Or please let me know if such APIs already exist; I asked on the mailing list but didn't get a reply.)

Attachments

5

Activity

Show:

David Matson October 5, 2023 at 5:11 PM

David Matson October 5, 2023 at 5:03 PM

Notes from the TC meeting:

  1. Collation Folding API Review

  2. Collation Folding Data Tables Review

    1. Early work building binary data: Adds ca. 750kB of data for all of the search tailorings

    2. There are probably ways to remove redundant data / make it smaller

  3. At runtime, consider

    1. mapping characters to characters – might make it easy to remove data for characters which don’t fold

    2. vs.

    3. CEs to characters – and leave it to a runtime Collator to handle discontiguous contractions etc.

  4. Where should the builder code go? – Can probably be added to the regular build, not necessarily the Unicode-file-specific build.

Daniel Ju October 2, 2023 at 7:51 PM

We’ve generated preliminary collation folding *.txt data files based off of the tool mentioned below that created the collation folding maps. The Collation ICU C APIs are used to generate these data files.

I’ve attached the “collation-folding-noninvariant-keys.zip” file that contains a readable form (see 2nd bullet below) of the data files.

Some notes on the files:

  • The data is generated from ICU 72.1.

  • The keys contain non-invariant characters for readability but will need to be reformatted to generate resource bundle data files for. A workaround would be to use space-separated Unicode codepoint values as the key.

  • There is only data for locales that have a non-fallback “search” collation type, including root.

  • The data files include all primary, secondary, and tertiary mappings for that locale.

  • The data files only include incremental/additive data mappings on top of root.

  • Aliased locales currently have duplicated data files (he <-> iw, no <-> no_NO, sh <-> sr_Latn).

Jeff Genovy September 22, 2023 at 10:57 PM

Attaching the files from David to this ticket: 3 text files, and the same files in one compressed “.zip” file. (Note: They are from an older version of ICU 68.2)

David Matson September 22, 2023 at 8:11 PM
Edited

We’ve put together a tool that creates collation folding maps, and we’d like to get folks to review the output to see if it looks right.

We have data for the root (search) collation, with primary, secondary, and tertiary strengths, which Jeff kindly helped attached to this ticket.

A couple of notes on the files:

  • The data is generated from an older version of ICU (whatever ships with Windows at the moment; ICU 68.2).

  • Default collation elements are not being folded to the empty string inside the table itself; instead, there’s a U+0000 marker, which could be replaced with an empty string or perhaps CGJ at runtime. The original hope was that running text through collation folding wouldn’t change the sort key generated for the text, so preserving text corresponding to default collation elements may be needed for that case (though perhaps not sufficient).

Details

Assignee

Reporter

Components

Priority

Time Needed

Weeks

Fix versions

Created June 22, 2023 at 7:39 PM
Updated October 5, 2023 at 5:11 PM