ICU 63 UnicodeSet startup performance regression

Description

Chrome has reported a startup performance regression (https://crbug.com/899983) that appears to be caused by changes I made in UnicodeSet for "sets & maps for Unicode properties" for creating a set from an enumerated-property value like [:script=Cyrl:].

Set building until ICU 62:

  • enumerate a many-properties trie to find code points where any of the properties changes value

  • add those code points to an "inclusion set" and cache it

  • for each inclusion set code point do a property trie lookup, compare with the value (e.g., Cyrl), and add to the result set

  • building another set for a value of the same property repeats the previous step: look up thousands of property values and add to the new result set

First-time set building until ICU 63:

  • enumerate a many-properties trie to find code points where any of the properties changes value

  • add those code points to an "inclusion set" and cache it

  • for an enumerated (not binary) property, build a single-property CodePointTrie:

    • for each inclusion set code point do a property trie lookup and add a range to the trie

    • build the trie into an immutable/small/fast form and cache it

  • use the trie range iteration function to find ranges with the value (e.g., Cyrl), and add to the result set

  • building another set for a value of the same property repeats the previous step: iterate over desired-value ranges in the cached single-property trie and add to the new result set

The enumerated-property CodePointTrie is also exposed via new public API.

Iterating over trie ranges is simpler than over-sampling code points and looking up a value for each from scratch, and should be faster, which should benefit building a set for a property value when the trie for that property has been cached already.

I suspect that building the trie is uncomfortably slow.

I will try to make some measurements and do some profiling.
I should probably try to bypass building a trie for a property set if there is not yet a trie for that property.
Longer-term, I should look into making the trie builder faster; that might turn into a separate ticket.

Activity

Show:
Markus Scherer
November 1, 2018, 8:24 PM

I am building a set like this, in C++:

This builds two single-property tries: One for Script, and one for General_Category.

On my Linux machine with clang++ 4 and an optimized build with debug symbols, this takes about 19ms. callgrind shows that most of the time is spent building the single-property tries (umutablecptrie_buildImmutable()) but unfortunately it does not give details about which part of that takes how long.

With a pure debug build (no optimization), it takes about 70ms. 92% of that is in umutablecptrie_buildImmutable(). Below that, 10% of the total time is in compactIndex() and 78% in compactData(). Most of that is in findSameBlock(), and half of the total time in equalBlocks() called from there. In other words, comparing each mixed-value block with any and all data previously written.

When I build this set twice, the second time takes 3..4ms, so using the trie should be fine. Building the trie is the problem.

Markus Scherer
November 3, 2018, 6:03 AM

I decided to try to make building the trie faster.
Most of the time was spent in comparing each data and index block with earlier parts of the data/index at all positions.
I wrote a simple, custom hashtable to replace the linear searches.

On my machine with an optimized build, building the above set now takes ca. 7.7ms, 60% less time than before.
A fair bit of the reduced time is spent building and updating the hashtables (e.g., computing hash codes). The lookups are negligible. Block equality is checked much, much less often (mostly for hash collisions).

C++ code here: https://github.com/markusicu/icu/tree/triebuildhash
To do: Make a pull request, send to Andy, remove debug/experimental code, port to Java.

Markus Scherer
November 6, 2018, 4:23 PM

Android GoogleIssue:118748230

Markus Scherer
November 6, 2018, 4:38 PM
Markus Scherer
November 12, 2018, 12:14 AM

After PR 265 I did some more perf work. I made UnicodeSet.add(new last range) faster, reduced the number of memory allocations in UnicodeSet, and made CodePointTrie.getRange() a bit faster.

In the end, building a set via range starts + lookup was still faster than building it from a CodePointTrie, both during startup (getting the set of range starts vs. building a trie) as well as later (iterating over range starts + lookup vs. trie.getRange()), so I reverted to that. I also cache a starts set per int property, to reduce the number of lookups.

On my machine, building the test set for the first time now takes about 1ms (down from the original 19ms); in a debug build, 2ms (down from 70ms). Building the set a second time takes ca. 0.17ms (down from 3..4ms; debug now takes 0.4ms).

See PR 278.

Fixed

Assignee

Markus Scherer

Reporter

Markus Scherer

Components

Labels

Reviewer

None

Priority

major

Time Needed

Hours

Fix versions

Configure