We're updating the issue view to help you get more done. 

ICU 63 UnicodeSet startup performance regression

Description

Chrome has reported a startup performance regression (https://crbug.com/899983) that appears to be caused by changes I made in UnicodeSet for "sets & maps for Unicode properties" for creating a set from an enumerated-property value like [:script=Cyrl:].

Set building until ICU 62:

  • enumerate a many-properties trie to find code points where any of the properties changes value

  • add those code points to an "inclusion set" and cache it

  • for each inclusion set code point do a property trie lookup, compare with the value (e.g., Cyrl), and add to the result set

  • building another set for a value of the same property repeats the previous step: look up thousands of property values and add to the new result set

First-time set building until ICU 63:

  • enumerate a many-properties trie to find code points where any of the properties changes value

  • add those code points to an "inclusion set" and cache it

  • for an enumerated (not binary) property, build a single-property CodePointTrie:

    • for each inclusion set code point do a property trie lookup and add a range to the trie

    • build the trie into an immutable/small/fast form and cache it

  • use the trie range iteration function to find ranges with the value (e.g., Cyrl), and add to the result set

  • building another set for a value of the same property repeats the previous step: iterate over desired-value ranges in the cached single-property trie and add to the new result set

The enumerated-property CodePointTrie is also exposed via new public API.

Iterating over trie ranges is simpler than over-sampling code points and looking up a value for each from scratch, and should be faster, which should benefit building a set for a property value when the trie for that property has been cached already.

I suspect that building the trie is uncomfortably slow.

I will try to make some measurements and do some profiling.
I should probably try to bypass building a trie for a property set if there is not yet a trie for that property.
Longer-term, I should look into making the trie builder faster; that might turn into a separate ticket.

Environment

Status

Assignee

Markus Scherer

Reporter

Markus Scherer

Labels

Time Needed

Hours

Components

Fix versions

Priority

major