Details

    • Type: Enhancement
    • Status: Done (View workflow)
    • Priority: medium
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 63.1
    • Component/s: properties
    • Labels:
      None
    • Time Needed:
      Weeks
    • tracCc:
      andy,jungshik
    • tracOwner:
      markus
    • tracProject:
      all
    • tracReporter:
      markus
    • tracStatus:
      accepted
    • tracWeeks:
      2

      Description

      UTrie2 was designed for lookups by code points, UTF-16, as well as UTF-8.

      It has some special data structures for UTF-8 non-shortest forms which are obsolete and unused since ICU 60 changed handling of ill-formed UTF-8 to collecting "maximal subparts" of valid sequences. (This is compatible with the W3C Encoding standard, so I expect it to not change again.)

      It provides for different values for lead surrogate code //units// vs. code //points//, which complicates the API, builder and runtime. We might be able to let those callers that need this handle it themselves. (Normalization properties only?)

      Size overhead: A UTrie2 with no data at all is about 4.6kB large. It compacts a large amount of data fairly well, but for collation where a typical tailoring has a small amount of data, we might be able to reduce the data size. With less overhead, we might also use more separate tries where we currently combine or avoid them.

      I have some ideas: http://site.icu-project.org/design/struct/utrie

      I will experiment and benchmark in a branch.

        Attachments

          Issue links

            Activity

              People

              • Assignee:
                markus.icu Markus Scherer
                Reporter:
                markus.icu Markus Scherer
                Reviewer:
                Andy Heninger
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:
                  tracCreated: