ko collation bug: [reorder Hang Hani] + numeric gives 40>72

Description

See 2013-feb-06 icu-support thread "wrong result in ucol_strcollIter() when collation reorder rules is [Hang Hani|reorder]." (http://sourceforge.net/mailarchive/message.php?msg_id=30453966)

  • Android ICU 4.8.1.1

  • ko_KR modified to include `[Hang Hani|reorder]`

  • strength=primary

  • numeric=on

  • ucol_strcollIter() for UTF-8 input

  • Wrong result: "40" > "72"

But, it works right when `[Hang|reorder]` or when numeric=off.

Activity

Show:

UnicodeBot 
July 1, 2018 at 12:09 AM

Trac Comment 5.7 by muzbit@f74d39fa044aa309—2013-09-15T22:35:34.612Z

Replying to (Comment 5 markus):

This was a fairly significant bug that had nothing to do with Korean, it just happened to manifest with the given script reordering and input strings. The bug was that in a couple of code paths isContinuation(CE) was tested after masking the continuation bits (among others) off of the CE value.

 

It showed up in this scenario because in the current code we get continuation CEs from numeric collation, and the difference was in there. They got incorrectly script-reordered (the permutation must be done only on the non-continuation lead byte).

 

The bug should show up just as well in certain differences among vanilla collation elements with 3-byte or 4-byte primaries. 3-byters are very common in recent versions of the root collation data. I am surprised that we have not seen more bug reports for script reordering.

 

This kind of bug cannot happen in my "collv2" code because there are no continuations there. That complication was one of the early goals and reasons to start a rewrite.

Thank you for your support^^

UnicodeBot 
July 1, 2018 at 12:09 AM

Trac Comment 5 by —2013-09-13T03:09:04.706Z

This was a fairly significant bug that had nothing to do with Korean, it just happened to manifest with the given script reordering and input strings. The bug was that in a couple of code paths isContinuation(CE) was tested after masking the continuation bits (among others) off of the CE value.

It showed up in this scenario because in the current code we get continuation CEs from numeric collation, and the difference was in there. They got incorrectly script-reordered (the permutation must be done only on the non-continuation lead byte).

The bug should show up just as well in certain differences among vanilla collation elements with 3-byte or 4-byte primaries. 3-byters are very common in recent versions of the root collation data. I am surprised that we have not seen more bug reports for script reordering.

This kind of bug cannot happen in my "collv2" code because there are no continuations there. That complication was one of the early goals and reasons to start a rewrite.

UnicodeBot 
July 1, 2018 at 12:09 AM

Trac Comment 4 by enh@1d5920f4b44b27a8—2013-08-26T17:01:57.391Z

see https://android-review.googlesource.com/64222

UnicodeBot 
July 1, 2018 at 12:09 AM

Trac Comment 3 by muzbit@f74d39fa044aa309—2013-07-11T05:43:29.627Z

I tested this issue using ucol_strcollUTF8() in Android AOSP Sources.
But the result was incorrect.

[preconditions'''|'''Test]

  • Base Environment : Android AOSP Image (Emulator)

  • ICU Version : AOSP's ICU Version is 51

  • Locale Resource File : ko_kr.txt [Hang '''Hani'''|reorder] (it is Android Native)

  • Modified Android Source Code
    external\sqlite\android\sqlite3_android.cpp
    1. In collate8() Function

    static int collate8(void *p, int n1, const void *v1, int n2, const void *v2) { UCollator *coll = (UCollator *) p; // UCharIterator i1, i2; //Added Code UErrorCode status = U_ZERO_ERROR; // uiter_setUTF8(&i1, (const char *) v1, n1); // uiter_setUTF8(&i2, (const char *) v2, n2); //UCollationResult result = ucol_strcollIter(coll, &i1, &i2, &status); //Added Code UCollationResult result = ucol_strcollUTF8(coll,(const char *)v1,n1,(const char *)v2,n2,&status); if (U_FAILURE(status)) { // ALOGE("Collation iterator error: %d\n", status); } ...

    2. In register_localized_collators() Function

    //Added Code_S: set ascending sort for digits in SQL #define LOCALIZED_NUM_COLLATOR_NAME "LOCALIZED_NUM" //Added Code_E extern "C" int register_localized_collators(sqlite3* handle, const char* systemLocale, int utf16Storage) { ... } else { err = sqlite3_create_collation_v2(handle, PHONEBOOK_COLLATOR_NAME, SQLITE_UTF8, collator, collate8, (void(*)(void*))localized_collator_dtor); } if (err != SQLITE_OK) { return err; } //// PHONEBOOK_COLLATOR //Added Code_S : set ascending sort for digits in SQL status = U_ZERO_ERROR; collator = ucol_open(systemLocale, &status); if (U_FAILURE(status)) { ALOGE("Locale set error"); return -1; } ucol_setAttribute(collator, UCOL_STRENGTH, UCOL_PRIMARY, &status); if (U_FAILURE(status)) { return -1; } ucol_setAttribute(collator, UCOL_NUMERIC_COLLATION, UCOL_ON, &status); if (U_FAILURE(status)) { return -1; } status = U_ZERO_ERROR; //ucol_getShortDefinitionString(collator, NULL, buf, 1024, &status); if (utf16Storage) { err = sqlite3_create_collation_v2(handle, LOCALIZED_NUM_COLLATOR_NAME, SQLITE_UTF16, collator, collate16, (void(*)(void*))localized_collator_dtor); } else { err = sqlite3_create_collation_v2(handle, LOCALIZED_NUM_COLLATOR_NAME, SQLITE_UTF8, collator, collate8, (void(*)(void*))localized_collator_dtor); } if (err != SQLITE_OK) { return err; } //Added Code_E

    framework/base/core/java/android/database/sqlite/SQLiteConnection.java
    3. In setLocaleFromConfiguration() Function

    private void setLocaleFromConfiguration() { ... try { execute("DELETE FROM android_metadata", null, null); execute("INSERT INTO android_metadata (locale) VALUES(?)", new Object[] { newLocale }, null); execute("REINDEX LOCALIZED", null, null); //Added Code_S:set ascending sort for digits in SQL execute("REINDEX LOCALIZED_NUM", null, null); //Added Code_E success = true; } finally { execute(success ? "COMMIT" : "ROLLBACK", null, null); }

    [Scenario'''|'''Test]
    1. Build modified source file & change output image & boot emulator
    2. In Emulator Change Locale : Settings--> Language&input --> Language --> 한국어
    3. Run Test App
    a. create test database
    b. insert 0~99 number

    ContentValues values = new ContentValues(); for(int i =0 ; i< 100; i++) { values.put("data", i); db.insert(TABLE_NAME, null, values); }

    c. Query numeric sort order and native localized sort order

    //"data COLLATE LOCALIZED_NUM ASC for numeric sort" Cursor cur = getContentResolver().query(Uri.parse("content://com.muzbit.collate_test"),new String[] {"data"}, null, null, "data COLLATE LOCALIZED_NUM ASC" ); //"data COLLATE LOCALIZED ASC for native localized sort" Cursor cur2 = getContentResolver().query(Uri.parse("content://com.muzbit.collate_test"),new String[] {"data"}, null, null, "data COLLATE LOCALIZED ASC" );

[Result'''|'''Tese]
07-10 21:16:40.976: D/muzbit(1047): LOCALIZED_NUM = 0/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/'''16/59/60/61/62/63/64/65/66/67/68/69/70/71/72/73/74/75/76/77/78/79/80/81/82/83/84/85/86/87/88/89/90/91/92/93/94/95/96/97/98/99/17/'''18/19/20/21/22/23/24/25/26/27/28/29/30/31/32/33/34/35/36/37/38/39/40/41/42/43/44/45/46/47/48/49/50/51/52/53/54/55/56/57/58/
07-10 21:16:41.146: D/muzbit(1047): LOCALIZED = 0/1/10/11/12/13/14/15/16/17/18/19/2/20/21/22/23/24/25/26/27/28/29/3/30/31/32/33/34/35/36/37/38/39/4/40/41/42/43/44/45/46/47/48/49/5/50/51/52/53/54/55/56/57/58/59/6/60/61/62/63/64/65/66/67/68/69/7/70/71/72/73/74/75/76/77/78/79/8/80/81/82/83/84/85/86/87/88/89/9/90/91/92/93/94/95/96/97/98/99/

Please review this result.

Thank you.

UnicodeBot 
July 1, 2018 at 12:09 AM

Trac Comment 1.2 by muzbit@f74d39fa044aa309—2013-06-17T01:26:26.809Z

Replying to (Comment 1 markus):

This seems to work fine in the Locale Explorer collation demo, but that does not use ucol_strcollIter(). Need to write test code and step in with a debugger.

May i know progress of this issue?

Fixed

Details

Assignee

Reporter

Components

Priority

Time Needed

Hours

Fix versions

Created June 28, 2018 at 5:20 PM
Updated October 3, 2018 at 10:53 PM
Resolved July 1, 2018 at 8:43 PM

Flag notifications