We're updating the issue view to help you get more done. 

Add Normalizer2.regionMatches

Description

Nico Williams discussed the ability to do a test for string equality that is normalization-form insensitive. (This was raised as an example in a (long) thread on discussion on ietf@ietf.org about i18n.) That is, the following, but optimized.

form_insensitive_strcmp(a, b) == memcmp(normalize(a), normalize(b))

I think that would be a useful addition to the Normalizer2 ([ICU-unknown]) implementation. The simplest form would be something like

Normalizer2 ([ICU-unknown]).equal(CharSequence a, CharSequence b)

Internally, the optimized version could compare code points until they differed, then drop into normalization code if the difference could possibly be resolved by normalization (ie, the differing code points were not Inert). (A further optimization could be to return back to code-point comparison afterwards.)

It would also be useful to support a generalization, so that one could check for matches within sequences. That would allow people to do the equivalent of startsWith(), for example, or compare segments within longer buffers.

Normalizer2 ([ICU-unknown]).regionMatches(
CharSequence a, int aStart, int aEnd,

CharSequence b, int bStart, int bEnd,

Output<Integer> aMax,

Output<Integer> bMax)

Returns true iff

normalize(a.subsequence(aStart, aEnd).toString())
.equals(
normalize(b.subsequence(bStart, bEnd).toString()))

// the "toString()" is just to verify that it is only the char contents of a and b that matter in the comparison//

If false, and both aMax and bMax are not null, sets them to the maximum offsets (≤ aEnd, bEnd, respectively) such that:

regionMatches(a, aStart, aMax.value, b, bStart, bMax.value) == true

The reason for having aEnd/bEnd is to not search beyond a point we care about within a longer buffer. The reason for having the aMax and bMax is to allow the determination of how far the matching extends.


I don't know whether it would be worthwhile to have a version that would support endsWith, which would require a "backwards" comparison, so I'd suggest not worrying about that for now.

Status

Assignee

Markus Scherer

Reporter

Mark Davis

Labels

None

Reviewer

None

Time Needed

None

Start date

None

Components

Fix versions

Priority

assess