UTrie2 was designed for lookups by code points, UTF-16, as well as UTF-8.
It has some special data structures for UTF-8 non-shortest forms which are obsolete and unused since ICU 60 changed handling of ill-formed UTF-8 to collecting "maximal subparts" of valid sequences. (This is compatible with the W3C Encoding standard, so I expect it to not change again.)
It provides for different values for lead surrogate code //units// vs. code //points//, which complicates the API, builder and runtime. We might be able to let those callers that need this handle it themselves. (Normalization properties only?)
Size overhead: A UTrie2 with no data at all is about 4.6kB large. It compacts a large amount of data fairly well, but for collation where a typical tailoring has a small amount of data, we might be able to reduce the data size. With less overhead, we might also use more separate tries where we currently combine or avoid them.
I have some ideas: http://site.icu-project.org/design/struct/utrie
I will experiment and benchmark in a branch.
ICU4C “make” currently builds 237 UTrie2s (BreakIterator & dictionaries, Collator, confusables?). Total size of these: 9989224 bytes = 9.53MB.
ICU also has 9 UTrie2 in prebuilt data files: one each in 4 *.nrm files, 2 tries in uprops, one each in ubidi, ucase, ucadata
I had mis-counted the number of tries by simply printing something every time a trie was built.
Whenever the genrb tool builds its first collation tailoring with rules (not just settings), the normalization code builds a 79.4kB trie with data for the CanonicalIterator. This happens 96 times during a "make" build because genrb is called separately for each resource bundle. Any single runtime process builds at most one of these, and often none.
Ignoring those, "make" builds only 141 tries that are stored in the data, with a total size of 2180968 bytes = 2.08MB.
For background on the changes in pull request 83 see ticket ICU-20097.