hang in string search function usearch_first() when search string is single combining diacritical

Description

usearch_first() never returns when the search string is a single combining diacritical. Even though this kind of search might not be logical, usearch_first() should not hang. I have tested on ICU 3.2.1 and ICU 3.8 and verified this issue exists on both. Here is a small program that demonstrates this.

#include <stdio.h>
#include "unicode/ucol.h"
#include "unicode/ubrk.h"
#include "unicode/usearch.h"

int main()
{
UChar search[] = { 0x0300 };
UChar source[] = { 0x0020,
0xDD3D, 0x0020, 0xDD3D, 0x0055, 0xDD3D, 0x0075, 0xDD3D, 0x00D9, 0xDD3D, 0x0055, 0x0300, 0xDD3D, 0x0055, 0x0340, 0xDD3D, 0x00F9, 0xDD3D, 0x0075, 0x0300, 0xDD3D, 0x0075, 0x0340, 0xDD3D, 0xD978, 0xDCF9, 0xDD3D, 0x0306,
0x0020 };

int32_t searchLen;
int32_t sourceLen;
UErrorCode icuStatus = U_ZERO_ERROR;
UCollator *coll;
const char *locale;
UBreakIterator *ubrk;
UStringSearch *usearch;
int32_t match = 0;

searchLen = sizeof(search)/sizeof(UChar);
sourceLen = sizeof(source)/sizeof(UChar);

coll = ucol_openFromShortString( "LDE_AN_CX_EX_FX_HX_NX_S2",
false,
NULL,
&icuStatus );
if ( U_FAILURE(icuStatus) )
{
printf( "ucol_openFromShortString error\n" );
goto exit;
}

locale = ucol_getLocaleByType( coll,
ULOC_VALID_LOCALE,
&icuStatus );
if ( U_FAILURE(icuStatus) )
{
printf( "ucol_getLocaleByType error\n" );
goto exit;
}

ubrk = ubrk_open( UBRK_CHARACTER,
locale,
source,
sourceLen,
&icuStatus );
if ( U_FAILURE(icuStatus) )
{
printf( "ubrk_open error\n" );
goto exit;
}

usearch = usearch_openFromCollator( search,
searchLen,
source,
sourceLen,
coll,
NULL,
&icuStatus );
if ( U_FAILURE(icuStatus) )
{
printf( "usearch_openFromCollator error\n" );
goto exit;
}

usearch_setAttribute( usearch,
USEARCH_OVERLAP,
USEARCH_ON,
&icuStatus );
if ( U_FAILURE(icuStatus) )
{
printf( "usearch_setAttribute error\n" );
goto exit;
}

match = usearch_first( usearch,
&icuStatus );
if ( U_FAILURE(icuStatus) )
{
printf( "usearch_first error\n" );
goto exit;
}

printf( "match=%d\n", match );

exit:
return 0;
}

Activity

Show:
TracBot
July 1, 2018, 12:16 AM
Trac Comment 1 by dmso@5b2f15167f229dfb—2007-10-16T19:12:15.000Z

More info. The repro program can be simplified by shorting the source string to only contain the character U+00D9 (LATIN CAPITAL LETTER U WITH GRAVE). The search string is U+0300 (COMBINING GRAVE ACCENT). The hang only occurs on Strength 2 collations, it does not seem to occur on Strength 1, 3, or 4 collations.

TracBot
July 1, 2018, 12:16 AM
Trac Comment 4 by —2007-11-06T20:05:26.000Z

The hanging search function is the result of an infinite loop caused by collation element iterator returning UCOL_NULLORDER which the loop condition was not checking for. This was in checkExtraMatchAccents internal function.

Fixed

Assignee

mow@icu-project.org

Reporter

TracBot

Components

Labels

None

Reviewer

None

Priority

major

Time Needed

Days

Fix versions