string search does not return correct position when search pattern is U+00C2 U+0303

Description

I found a bug in string search code. We are using ICU 3.2.1 but I confirmed the bug exists in ICU 3.8. Here is a self-contained program that reproduces the problem. I expect match to be 1, but it get a value of 4. Also, this bug only occurs when I use a Strength 1 collator. If I used a Strength 2 or greater collator, I get the correct value of 1.

Activity

Show:
TracBot
July 1, 2018, 12:13 AM
Trac Comment 4 by —2007-09-21T05:03:21.000Z

Some more information:

the break iterator and locale code above can be removed. the collator is equal to 'en_US' at L1.

The search string given at pri weight is equivalent to \u00C2, the \u0303 is ignored.

Oddly enough, with source "\u00C2" and search string "\u00C2" there is a match in every strength ''except'' L2. hasAccentsAfterMatch() is activated and as an uneducated guess I suspect something wrong in there. I note a getFCD call, perhaps \u00c2\u0303 is a denormalized form.

It seems like there should be a test to make sure that "\u00c2" finds "\u00c2", etc for all weights. Or, perhaps there is some interaction with the normalization mode.

TracBot
July 1, 2018, 12:13 AM
Trac Comment 4.5 by —2007-09-21T05:25:54.000Z

I think this might be working as expected. I think by default, normalization is disabled in en_US for performance reasons. Only some locales turn normalization on.

\u00c2\u0303 is denormalized. The NFC form is \u1eaa, and NFD is \u0041\u0302\u0303.

So you might want to try it with normalization on, or try the NFC or NFD forms to see if things change.

TracBot
July 1, 2018, 12:13 AM
Trac Comment 6 by —2007-09-22T23:27:18.000Z

Normalization on for the collator did not 'fix' the problem.

#5954 has a simplified test case for what is probably a related issue.

TracBot
July 1, 2018, 12:13 AM
Trac Comment 7 by dmso@5b2f15167f229dfb—2007-09-25T18:35:10.000Z

More information I found from experimenting.

When the source string contains an unnormalized sequence (U+00C2 U+0303), the sequence is not matched by ICU when the search string is either the unnormalized sequence (U+00C2 U+0303), NFC/NFKC(U+1EAA), or NFD/NFKD(U+0041 U+0302 U+0303). It only affects a strength 1 collator (tried LDE and LEN). For strength 1 collator, we should be case and accent insensitive, so we should match the unnormalized sequence (U+00C2 U+0303), but ICU does not. It is the same regardless of whether normalization checking is on (NO) or off (NX) when the collator is opened. Note that the ICU does find a match when the source string contains NFC/NFKC(U+1EAA) or NFD/NFKD(U+0041 U+0302 U+0303), which suggests ICU does some normalization on the search string but not the source string.

TracBot
July 1, 2018, 12:13 AM
Trac Comment 10 by —2007-10-08T22:12:30.000Z

Part of the cause of the problem is due to the prefix and suffix accent detection during the pattern initialization phase of the string search process. An accent is detected on 0x00C2 (even though accents are ignored in strength 1 collator) and so it incorrectly skips the first occurrence of 0x0041 when shiftForward is called the second time during the actual search phase. This explains why this issue only comes up with a strength 1 collator.

Fixed

Assignee

mow@icu-project.org

Reporter

TracBot

Components

Labels

None

Reviewer

None

Priority

major

Time Needed

Days

Fix versions