Regex performance with UTF-8 UText* inputs is worse than UnicodeString inputs. But there's a couple simple optimizations that help a lot:
1. Extra allocations when creating RegexMatchers. Currently, getting a RegexMatcher for a UText* input looks like this:
// Allocates space for fInputText
// Allocates extra space in pExtra to copy the UTF8Bufs
If there were a RegexPattern::matcher(UText*, UErrorCode&) overload, the two allocations could be collapsed to one. I've attached a patch that does this. The effect is a 10% speedup for my application (though I recognize for best performance I should just re-use the RegexMatcher).
2. Chunk-based matching doesn't happen. The implementation uses UTEXT_FULL_TEXT_IN_CHUNK() to check if the whole input string is available in chunkContents. But for UTF-8 inputs, chunkContents isn't filled in until access() is called.
I've attached a patch that manually calls fInputText->pFuncs->access(). I'm not sure that's the best way to do it, but it increases performance for my application by another 16%. Both patches together combine for a 25% boost.
Compared to using UnicodeString, my application was 36% slower with UTF-8 inputs before these patches, and only 15% slower after.