ko collation bug: [reorder Hang Hani] + numeric gives 40>72
Description
Activity
UnicodeBot July 1, 2018 at 12:09 AM
Trac Comment 5.7 by muzbit@f74d39fa044aa309—2013-09-15T22:35:34.612Z
Replying to (Comment 5 markus):
This was a fairly significant bug that had nothing to do with Korean, it just happened to manifest with the given script reordering and input strings. The bug was that in a couple of code paths isContinuation(CE) was tested after masking the continuation bits (among others) off of the CE value.
It showed up in this scenario because in the current code we get continuation CEs from numeric collation, and the difference was in there. They got incorrectly script-reordered (the permutation must be done only on the non-continuation lead byte).
The bug should show up just as well in certain differences among vanilla collation elements with 3-byte or 4-byte primaries. 3-byters are very common in recent versions of the root collation data. I am surprised that we have not seen more bug reports for script reordering.
This kind of bug cannot happen in my "collv2" code because there are no continuations there. That complication was one of the early goals and reasons to start a rewrite.
Thank you for your support^^
UnicodeBot July 1, 2018 at 12:09 AM
Trac Comment 5 by @Markus Scherer—2013-09-13T03:09:04.706Z
This was a fairly significant bug that had nothing to do with Korean, it just happened to manifest with the given script reordering and input strings. The bug was that in a couple of code paths isContinuation(CE) was tested after masking the continuation bits (among others) off of the CE value.
It showed up in this scenario because in the current code we get continuation CEs from numeric collation, and the difference was in there. They got incorrectly script-reordered (the permutation must be done only on the non-continuation lead byte).
The bug should show up just as well in certain differences among vanilla collation elements with 3-byte or 4-byte primaries. 3-byters are very common in recent versions of the root collation data. I am surprised that we have not seen more bug reports for script reordering.
This kind of bug cannot happen in my "collv2" code because there are no continuations there. That complication was one of the early goals and reasons to start a rewrite.
UnicodeBot July 1, 2018 at 12:09 AM
Trac Comment 4 by enh@1d5920f4b44b27a8—2013-08-26T17:01:57.391Z
UnicodeBot July 1, 2018 at 12:09 AM
Trac Comment 3 by muzbit@f74d39fa044aa309—2013-07-11T05:43:29.627Z
I tested this issue using ucol_strcollUTF8() in Android AOSP Sources.
But the result was incorrect.
[preconditions'''|'''Test]
Base Environment : Android AOSP Image (Emulator)
ICU Version : AOSP's ICU Version is 51
Locale Resource File : ko_kr.txt [Hang '''Hani'''|reorder] (it is Android Native)
Modified Android Source Code
external\sqlite\android\sqlite3_android.cpp
1. In collate8() Functionstatic int collate8(void *p, int n1, const void *v1, int n2, const void *v2) { UCollator *coll = (UCollator *) p; // UCharIterator i1, i2; //Added Code UErrorCode status = U_ZERO_ERROR; // uiter_setUTF8(&i1, (const char *) v1, n1); // uiter_setUTF8(&i2, (const char *) v2, n2); //UCollationResult result = ucol_strcollIter(coll, &i1, &i2, &status); //Added Code UCollationResult result = ucol_strcollUTF8(coll,(const char *)v1,n1,(const char *)v2,n2,&status); if (U_FAILURE(status)) { // ALOGE("Collation iterator error: %d\n", status); } ...
2. In register_localized_collators() Function
//Added Code_S: set ascending sort for digits in SQL #define LOCALIZED_NUM_COLLATOR_NAME "LOCALIZED_NUM" //Added Code_E extern "C" int register_localized_collators(sqlite3* handle, const char* systemLocale, int utf16Storage) { ... } else { err = sqlite3_create_collation_v2(handle, PHONEBOOK_COLLATOR_NAME, SQLITE_UTF8, collator, collate8, (void(*)(void*))localized_collator_dtor); } if (err != SQLITE_OK) { return err; } //// PHONEBOOK_COLLATOR //Added Code_S : set ascending sort for digits in SQL status = U_ZERO_ERROR; collator = ucol_open(systemLocale, &status); if (U_FAILURE(status)) { ALOGE("Locale set error"); return -1; } ucol_setAttribute(collator, UCOL_STRENGTH, UCOL_PRIMARY, &status); if (U_FAILURE(status)) { return -1; } ucol_setAttribute(collator, UCOL_NUMERIC_COLLATION, UCOL_ON, &status); if (U_FAILURE(status)) { return -1; } status = U_ZERO_ERROR; //ucol_getShortDefinitionString(collator, NULL, buf, 1024, &status); if (utf16Storage) { err = sqlite3_create_collation_v2(handle, LOCALIZED_NUM_COLLATOR_NAME, SQLITE_UTF16, collator, collate16, (void(*)(void*))localized_collator_dtor); } else { err = sqlite3_create_collation_v2(handle, LOCALIZED_NUM_COLLATOR_NAME, SQLITE_UTF8, collator, collate8, (void(*)(void*))localized_collator_dtor); } if (err != SQLITE_OK) { return err; } //Added Code_E
framework/base/core/java/android/database/sqlite/SQLiteConnection.java
3. In setLocaleFromConfiguration() Functionprivate void setLocaleFromConfiguration() { ... try { execute("DELETE FROM android_metadata", null, null); execute("INSERT INTO android_metadata (locale) VALUES(?)", new Object[] { newLocale }, null); execute("REINDEX LOCALIZED", null, null); //Added Code_S:set ascending sort for digits in SQL execute("REINDEX LOCALIZED_NUM", null, null); //Added Code_E success = true; } finally { execute(success ? "COMMIT" : "ROLLBACK", null, null); }
[Scenario'''|'''Test]
1. Build modified source file & change output image & boot emulator
2. In Emulator Change Locale : Settings--> Language&input --> Language --> 한국어
3. Run Test App
a. create test database
b. insert 0~99 numberContentValues values = new ContentValues(); for(int i =0 ; i< 100; i++) { values.put("data", i); db.insert(TABLE_NAME, null, values); }
c. Query numeric sort order and native localized sort order
//"data COLLATE LOCALIZED_NUM ASC for numeric sort" Cursor cur = getContentResolver().query(Uri.parse("content://com.muzbit.collate_test"),new String[] {"data"}, null, null, "data COLLATE LOCALIZED_NUM ASC" ); //"data COLLATE LOCALIZED ASC for native localized sort" Cursor cur2 = getContentResolver().query(Uri.parse("content://com.muzbit.collate_test"),new String[] {"data"}, null, null, "data COLLATE LOCALIZED ASC" );
[Result'''|'''Tese]
07-10 21:16:40.976: D/muzbit(1047): LOCALIZED_NUM = 0/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/'''16/59/60/61/62/63/64/65/66/67/68/69/70/71/72/73/74/75/76/77/78/79/80/81/82/83/84/85/86/87/88/89/90/91/92/93/94/95/96/97/98/99/17/'''18/19/20/21/22/23/24/25/26/27/28/29/30/31/32/33/34/35/36/37/38/39/40/41/42/43/44/45/46/47/48/49/50/51/52/53/54/55/56/57/58/
07-10 21:16:41.146: D/muzbit(1047): LOCALIZED = 0/1/10/11/12/13/14/15/16/17/18/19/2/20/21/22/23/24/25/26/27/28/29/3/30/31/32/33/34/35/36/37/38/39/4/40/41/42/43/44/45/46/47/48/49/5/50/51/52/53/54/55/56/57/58/59/6/60/61/62/63/64/65/66/67/68/69/7/70/71/72/73/74/75/76/77/78/79/8/80/81/82/83/84/85/86/87/88/89/9/90/91/92/93/94/95/96/97/98/99/
Please review this result.
Thank you.
UnicodeBot July 1, 2018 at 12:09 AM
Trac Comment 1.2 by muzbit@f74d39fa044aa309—2013-06-17T01:26:26.809Z
Replying to (Comment 1 markus):
This seems to work fine in the Locale Explorer collation demo, but that does not use ucol_strcollIter(). Need to write test code and step in with a debugger.
May i know progress of this issue?
See 2013-feb-06 icu-support thread "wrong result in ucol_strcollIter() when collation reorder rules is [Hang Hani|reorder]." (http://sourceforge.net/mailarchive/message.php?msg_id=30453966)
Android ICU 4.8.1.1
ko_KR modified to include `[Hang Hani|reorder]`
strength=primary
numeric=on
ucol_strcollIter() for UTF-8 input
Wrong result: "40" > "72"
But, it works right when `[Hang|reorder]` or when numeric=off.