String start with letter alif (ا) should not be indexed under hamza (ء) when using both locale ur and ar

General

Trac Info

General

Trac Info

Description

hamza (ء) used in Arabic and Urdu, but string start with letter alif (ا) should not be indexed under hamza (ء). It should be indexed under alif (ا).

It may be ICU bug, but it sounds more like locale issue than ICU issue, so I reported the issue here.

Disclaimer: I am not a native speaker of arabic nor urdu. But apparently, alif (ا) is commonly used in arabic.

Arabic Collator in ICU put alif (ا) and hamza (ء) into the same bucket, but Urdu Collator in ICU doesn't. If hamza (ء) should be in a different index, it could be a collation bug in Arabic. Here is the code to reproduce the issue.

GoogleIssue:31034811

Activity

Show:

TracBot May 9, 2019 at 9:15 PM

Trac Comment 8 by vichang@1d5920f4b44b27a8—2018-08-28T18:45:59.775Z

Our code in Android is very generic wihtout specifying locale except English.

But I guess it's possible to fix this by adding a special case, by checking the locale and exclude hamza in the unicode set.

TracBot May 9, 2019 at 9:15 PM

Trac Comment 7 by —2018-08-28T18:26:45.509Z

Do you think ICU AlphabeticIndex should have a concept of primary locale?

There is a little bit of that: Only the constructor looks in the collator's data for special buckets for Han characters. addLabels(even for a Chinese locale) will not do that.

addLabels() just adds characters and strings to an unordered set. Later code iterates in code point order, creates buckets, and removes duplicates according to the collator. The hamza code point comes before the one for alef...

Imposing an order would take some work, and there is a chance that it may break some use case while fixing this one.

Could you change the call site to not add Urdu if you already have Arabic, or only add the Urdu index exemplars minus the hamza?

More generally, it may help to not add a locale with the same script as an earlier one, but it might actually be useful to have buckets for multiple same-script languages (e.g., French & German). A good index for multilingual use is hard :-/

TracBot May 9, 2019 at 9:15 PM

Trac Comment 6 by vichang@1d5920f4b44b27a8—2018-08-28T17:51:51.988Z

It is possible that the caller adds index buckets from multiple locales and ends up adding the hamza from something other than Arabic, which might result in the hamza as the bucket label.

Yes. It's the case. Here is the code to reproduce the issue.

IIRC, AlphabeticIndex has no concept of primary or secondary locale. When AlphabeticIndex creates the bucket, hamza (ء) always override alif (ا) as the bucket label if they are in the same bucket.

Do you think ICU AlphabeticIndex should have a concept of primary locale?

TracBot May 9, 2019 at 9:15 PM

Trac Comment 5 by —2018-08-27T18:09:41.457Z

I am not sure if there is a bug here. It is common for different languages to have different sort orders while sharing a script.

A few years ago, we rewrote the Arabic sort order based on feedback from IBM Egypt and working to clarify and iterate with them. See the comments in source:trunk/common/collation/ar.xml for details.

We discussed making similar changes for other Arabic-script languages but have not had time to look into it, nor to look for people to work with.

I see that Urdu has a tailoring, and source:trunk/common/collation/ur.xml points to a reference document for that. In the absence of more detailed comments in the tailoring file (and the absence of LRMs to make the rules more readable), it is a bit harder to make out what it tries to do.

Note that if two index exemplar characters compare primary-equal, then only one of them will be used for the index bucket. However, I see the alef but not the hamza in source:trunk/common/main/ar.xml:

It is possible that the caller adds index buckets from multiple locales and ends up adding the hamza from something other than Arabic, which might result in the hamza as the bucket label.

TracBot May 9, 2019 at 9:15 PM

Trac Comment 3 by —2018-06-20T15:53:36.262Z

I looked at Apple overrides. We don't currently have any specific to the Arabic collation. We do have some overrides in the root collations, most of them specific to the search collator, but none specific to Arabic. For the root search collator the changes are:

Also, for all of the root collators that do not already do this, we add specific collations for the England/Scotland/Wales emoji flags:

Details

Components

Labels

google

Priority

assess

Phase

pre-sub

Assignee

Markus Scherer

Reporter

Victor Chang

locale

ur ar

Created January 11, 2019 at 5:16 AM

Updated November 11, 2021 at 5:09 PM

Configure