We're updating the issue view to help you get more done. 

A couple regex optimizations with UTF-8 UText* inputs

Description

Regex performance with UTF-8 UText* inputs is worse than UnicodeString inputs. But there's a couple simple optimizations that help a lot:

1. Extra allocations when creating RegexMatchers. Currently, getting a RegexMatcher for a UText* input looks like this:

  • RegexPattern::matcher(UErrorCode&)

  • RegexMatcher::RegexMatcher(const RegexPattern*)

  • RegexMatcher::init2(RegexStaticSets::gStaticSets->fEmptyText)

  • RegexMatcher::reset(fEmptyText)

  • // Allocates space for fInputText
    utext_clone(fInputText, fEmptyText)

  • RegexMatcher::reset(UText* input)

  • // Allocates extra space in pExtra to copy the UTF8Bufs
    utext_clone(fInputText, input)

If there were a RegexPattern::matcher(UText*, UErrorCode&) overload, the two allocations could be collapsed to one. I've attached a patch that does this. The effect is a 10% speedup for my application (though I recognize for best performance I should just re-use the RegexMatcher).

2. Chunk-based matching doesn't happen. The implementation uses UTEXT_FULL_TEXT_IN_CHUNK() to check if the whole input string is available in chunkContents. But for UTF-8 inputs, chunkContents isn't filled in until access() is called.

I've attached a patch that manually calls fInputText->pFuncs->access(). I'm not sure that's the best way to do it, but it increases performance for my application by another 16%. Both patches together combine for a 25% boost.

Compared to using UnicodeString, my application was 36% slower with UTF-8 inputs before these patches, and only 15% slower after.

Status

Assignee

Andy Heninger

Reporter

TracBot

Labels

None

Reviewer

None

Time Needed

Hours

Start date

None

Components

Priority

medium