We're updating the issue view to help you get more done. 

RFE: add to UCA table the mapping offsets for implicit primaries

Description

Collation computes "implicit" primary weights on the fly for all code points
that are not mentioned in the UCA+tailoring tables. As part of the computation,
the code point is remapped so that Han characters are ordered before unassigned
code points, Unihan before extension A, etc. This is hardcoded in several places

  • the FractionalUCA generator and the C and J library code.

It would be more robust for maintenance and for future new Han characters
inserted before existing ones to include ranges of code points with their
mapping offsets in FractionalUCA.txt, and to use them for the implicit-weight
computation instead of the hardcoded ranges and offsets.

The data could be added to FractionalUCA.txt when other additions are made for
more data-driven operation.

genuca could either add another data structure with a table of { range start,
range end, signed offset }, or fill the data into the main UCA collation trie,
as new value bits of the special CEs for implicits. The former would take only a
few bytes but require a search at runtime, while the latter would use the
available runtime data's bits at a cost of a couple of kB.

Status

Assignee

Markus Scherer

Reporter

TracBot

Labels

Reviewer

None

Time Needed

Days

Start date

None

Components

Fix versions

Priority

medium