Avoid UnicodeSet constructors taking String pattern when all code points are known

Description

Deleted Component: misc

UnicodeSet constructors taking String pattern is much slower than a constructor taking code points as int array. When all code points used for an instance of UnicodeSet are known, we should use the int array version.

For example, StringTokenizer has a static final field - DEFAULT_DELIMITERS defined as below -

Although this is one time initialization, this code itself takes 90% of StringTokenizer initialization time. With the code below -

the class initializer for StringTokenizer is about 15 times faster than the current.

It looks there are several other instances in ICU code which can be changed to use the faster constructor.

Activity

Show:
TracBot
June 30, 2018, 11:39 PM
Trac Comment 2 by —2011-01-17T19:32:18.153Z

Updated the usage in StringTokenizer addressed in this ticket. There is another candidate - AlphabeticIndex.HANGUL, but there are other UnicodeSet constructors using Unicode property in the same class, so the expected performance improvement is really minor. (Also, this specific instance contains 14 independent code points and with UnicodeSet(int...), you have to specify 28 int (duplicating each code points, because the constructor takes pair of code point range).. which is somewhat ugly.)

TracBot
June 30, 2018, 11:39 PM
Trac Comment 7 by —2016-10-05T23:13:36.787Z

Milestone 4.7.1 deleted

Fixed

Assignee

Yoshito Umaoka

Reporter

Yoshito Umaoka

Components

None

Labels

None

Reviewer

None

Priority

minor

Time Needed

None

Fix versions