smaller/simpler FCD data

Description

(FCD is the Fast C or D data we use for some text processing, especially in collation when the normalization check is on.)

FCD data used to be in the one unorm.icu, then hardcoded in the common library. We had it only for the Unicode NFC/NFD data.

A couple of years ago, I redesigned the normalization code & data. We now support more normalization forms, and support all operations with all data. I omitted the FCD data from the files to keep them small. FCD data is lazily initialized, for example when the first collator is opened. This adds some 10kB to each ICU-collating process' heap.

Idea:

  • Use smaller, simpler data for fast FCD lookups for the most common characters.

  • Fall back to a slightly slower lookup in the regular, existing normalization data when the fast&simple data does not yield a conclusive value.

  • Make the data small enough so that we can just include it in the data file and remove the lazy-init of the larger FCD UTrie2; this will remove one synchronization point from the library, and the 10kB of per-collation-process heap memory.

More concretely:

  • Store a flat array of tccc byte values for the first N code points where U+00A0<=N<=U+0300.

  • We know (or enforce in gennorm2) that below U+00A0 lccc=tccc=0 and below U+0300 lccc=0, so the tccc lookup alone is sufficient.

  • With N>=U+0100, we cover the performance-sensitive Latin-1 range.

  • It looks like reasonable values of N are (with diminishing returns)

  • U+0100 (256) to include Latin-1

  • U+0180 (384) to include Latin Extended-A (French oe-ligature, Eastern Europe, ...)

  • Store another array of bytes, indexed by the top bits of BMP code points.

  • If a byte is non-zero, then at least one of that group of code points has a non-zero FCD value and is to be looked up in the normal data.

  • Each bit in that byte represents an eighth of the group.

  • The main goal is to cover CJK with zero-bytes and quickly determine that they are FCD-inert.

  • It would be nice to have enough bits so that the core Cyrillic letters could get FCD=0 determined quickly. However, Cyrillic letter short i (U+0419 & U+0439) is in the way.

  • It would be nice to keep the FCD data a small percentage of the total normalization data.

  • For lead surrogates, we could set bits if the corresponding ranges of supplementary code points have any non-zero FCD values.

  • If we store an array of, say, 256 bytes, then each byte is for a group of 256 code points, each bit for 32.

The lookup would be something like

For Latin-1 and CJK, this should be faster than the current UTrie2 lookup. It would add 384+256=640 bytes to the normalization data, which is about a 2% increase for nfc.nrm, the smallest standard .nrm file.

Activity

Show:
TracBot
July 1, 2018, 12:19 AM
Trac Comment 1 by —2011-11-28T14:41:08.822Z

The smallFCD[] bit set was added in ticket #8804 (C++ r30980 & Java r30981) as part of the changes for .nrm formatVersion 2.0. The tccc180[] array is computed at load time, using smallFCD[] to accelerate.

TracBot
July 1, 2018, 12:19 AM
Trac Comment 5 by —2016-10-05T23:02:48.686Z

Milestone 49.0.2 (m2) deleted

Fixed

Assignee

Markus Scherer

Reporter

Markus Scherer

Components

Labels

None

Reviewer

None

Priority

minor

Time Needed

Hours

Fix versions