We're updating the issue view to help you get more done. 

optimize titlecasing BreakIterator handling

Description

The titlecasing functions create new BreakIterator instances on the fly if an iterator is not passed in. We should try to make this faster. Ideas:

  • Does BreakIterator itself cache anything?

  • Does word or sentence iteration differ by locale, or can we always use root-locale iterators?

  • Cache single instances?

  • Check out & back in, or clone?

  • Can we use some internal trick to avoid ever loading word break dictionaries, since those apply only to unicameral scripts?

Also, in 2016 I wrote a CasingWordIterator prototype that implements the word break rules as far as they are relevant to titlecasing, based on simple property lookups. Small code, no dependencies, quick to create. We could use that. Drawback: We would want to add fairly heavy testing to make sure it behaves like the real word break iterator, and we would have to make changes in both when the rules change.

We could refresh BreakIterator.getTitleInstance() with rules parallel to those for getWordInstance(), but again only as far as they are relevant to titlecasing (no dictionaries, care about starts of words not ends of words).

We could test that titlecasing with the implicit iterator (when passing in null) yields the same results as when titlecasing with an explicit getWordInstance().

Investigate, experiment, maybe spawn sub-tickets.

Status

Assignee

Andy Heninger

Reporter

Markus Scherer

Time Needed

Days

Components

Fix versions

Priority

medium