Nico Williams discussed the ability to do a test for string equality that is normalization-form insensitive. (This was raised as an example in a (long) thread on discussion on firstname.lastname@example.org about i18n.) That is, the following, but optimized.
form_insensitive_strcmp(a, b) == memcmp(normalize(a), normalize(b))
I think that would be a useful addition to the Normalizer2 ([ICU-unknown]) implementation. The simplest form would be something like
Normalizer2 ([ICU-unknown]).equal(CharSequence a, CharSequence b)
Internally, the optimized version could compare code points until they differed, then drop into normalization code if the difference could possibly be resolved by normalization (ie, the differing code points were not Inert). (A further optimization could be to return back to code-point comparison afterwards.)
It would also be useful to support a generalization, so that one could check for matches within sequences. That would allow people to do the equivalent of startsWith(), for example, or compare segments within longer buffers.
CharSequence a, int aStart, int aEnd,
CharSequence b, int bStart, int bEnd,
Returns true iff
// the "toString()" is just to verify that it is only the char contents of a and b that matter in the comparison//
If false, and both aMax and bMax are not null, sets them to the maximum offsets (≤ aEnd, bEnd, respectively) such that:
regionMatches(a, aStart, aMax.value, b, bStart, bMax.value) == true
The reason for having aEnd/bEnd is to not search beyond a point we care about within a longer buffer. The reason for having the aMax and bMax is to allow the determination of how far the matching extends.
I don't know whether it would be worthwhile to have a version that would support endsWith, which would require a "backwards" comparison, so I'd suggest not worrying about that for now.