improve on UTrie2

Description

UTrie2 was designed for lookups by code points, UTF-16, as well as UTF-8.

It has some special data structures for UTF-8 non-shortest forms which are obsolete and unused since ICU 60 changed handling of ill-formed UTF-8 to collecting "maximal subparts" of valid sequences. (This is compatible with the W3C Encoding standard, so I expect it to not change again.)

It provides for different values for lead surrogate code //units// vs. code //points//, which complicates the API, builder and runtime. We might be able to let those callers that need this handle it themselves. (Normalization properties only?)

Size overhead: A UTrie2 with no data at all is about 4.6kB large. It compacts a large amount of data fairly well, but for collation where a typical tailoring has a small amount of data, we might be able to reduce the data size. With less overhead, we might also use more separate tries where we currently combine or avoid them.

I have some ideas: http://site.icu-project.org/design/struct/utrie

I will experiment and benchmark in a branch.

Activity

Show:
TracBot
June 30, 2018, 11:58 PM
Trac Comment 1 by —2017-12-29T21:52:58.722Z

ICU4C “make” currently builds 237 UTrie2s (BreakIterator & dictionaries, Collator, confusables?). Total size of these: 9989224 bytes = 9.53MB.

ICU also has 9 UTrie2 in prebuilt data files: one each in 4 *.nrm files, 2 tries in uprops, one each in ubidi, ucase, ucadata

TracBot
June 30, 2018, 11:58 PM
Trac Comment 3 by —2018-01-09T21:56:00.562Z

I had mis-counted the number of tries by simply printing something every time a trie was built.

Whenever the genrb tool builds its first collation tailoring with rules (not just settings), the normalization code builds a 79.4kB trie with data for the CanonicalIterator. This happens 96 times during a "make" build because genrb is called separately for each resource bundle. Any single runtime process builds at most one of these, and often none.

Ignoring those, "make" builds only 141 tries that are stored in the data, with a total size of 2180968 bytes = 2.08MB.

Markus Scherer
August 28, 2018, 8:14 PM

For background on the changes in pull request 83 see ticket ICU-20097.

Fixed

Assignee

Markus Scherer

Reporter

Markus Scherer

Components

Labels

None

Reviewer

Andy Heninger

Priority

medium

Time Needed

Weeks

Fix versions

Configure