We're updating the issue view to help you get more done. 

RFE: more compact CJK collation data

Description

Collation data for CJK are among the largest ICU data pieces.

This is largely because CJK unified ideographs are reordered according to some
permutation (some charset order, stroke order, phonetic order, ...).

If there are more than some 11000 primary differences after a "&[top]" then
those primary weights become 3 bytes long, which results in an expansion table
entry, i.e., 12 bytes per such collation entry for (not all but) many of the
reordered ideographs.

These collation tables are expected to grow with the addition of many more
ideographs in Unicode 3.1 and later.

Ideas for reducing the sizes of these tables, and thus also increasing the
performance of collation with them:

1. For Pinyin, we might be able to have primary differences only per Pinyin
syllable, and secondary differences within. -> 4B/entry
(Rule change, no code change, probably applies only to zh__PINYIN.)

2. We could add a special tag that then allows to store 3-byte primary weights,
with implicitly "ignorable" secondary and tertiary/case values. This would avoid
the expansions for these collation entries. -> 4B/entry
(Fairly simple code change.)

3. We could add a different special tag that is used for a whole range of code
points (e.g. 4e00..9fa5) and points to a simple table with 3B per code point in
that range. Some 3-byte combination would mean "no entry". The .txt rules would
need to specify a permutation resulting in only primary weights as in 2.
This would store 3B/entry but possibly some unassigned entries ("holes" in the
range).
(More complicated code change, new data structures in collation data.)

Note that 3. would also prevent another limitation of the collation data
structure: the main trie table is designed to hold only 64k entries. If one
permutates >70000 ideographs, plus other collation entries, then this will not
fit in the current structure.

Option 3. would keep most of these entries in a few separate, linear tables.
Otherwise, if such large permutations will be used, we would need to redesign
the trie table structure, for example with one trie index table value being
shifted left one to allow for 128k entries in the trie data table. (Which
reduces somewhat the compactability of the trie.)

Environment

Status

Assignee

weivsara@gmail.com

Reporter

TracBot

Labels

tracCreated

Nov 02, 2001, 2:05 AM

tracOwner

weiv

tracReporter

markus.scherer@a95c9666650cfc8d

tracResolution

wontfix

tracReviewer

markus

tracStatus

closed

Components

Fix versions

Priority

major