We're updating the issue view to help you get more done. 

improve on UTrie2

Description

UTrie2 was designed for lookups by code points, UTF-16, as well as UTF-8.

It has some special data structures for UTF-8 non-shortest forms which are obsolete and unused since ICU 60 changed handling of ill-formed UTF-8 to collecting "maximal subparts" of valid sequences. (This is compatible with the W3C Encoding standard, so I expect it to not change again.)

It provides for different values for lead surrogate code //units// vs. code //points//, which complicates the API, builder and runtime. We might be able to let those callers that need this handle it themselves. (Normalization properties only?)

Size overhead: A UTrie2 with no data at all is about 4.6kB large. It compacts a large amount of data fairly well, but for collation where a typical tailoring has a small amount of data, we might be able to reduce the data size. With less overhead, we might also use more separate tries where we currently combine or avoid them.

I have some ideas: http://site.icu-project.org/design/struct/utrie

I will experiment and benchmark in a branch.

Status

Assignee

Markus Scherer

Reporter

Markus Scherer

Labels

None

Reviewer

Andy Heninger

Time Needed

Weeks

Start date

None

Components

Fix versions

Priority

medium