u_strToTitle() doesn't return expected result
Description
Activity
Markus Scherer January 13, 2024 at 1:02 AM
A string consisting only of U+0345 (iota-subscript) is not changed when titlecased with u_strToTitle(). U+0345 is marked as Cased and Changes_When_Titlecased, so I can’t see any reason why ICU is not using the titlecase form.
As just found, by default ICU string titlecase mapping adjusts the start-of-word index to the next letter/number/symbol and titlecases that character. U+0345 is a combining mark (gc=Mn) so it is skipped.
You should be able to use the U_TITLECASE_ADJUST_TO_CASED or U_TITLECASE_NO_BREAK_ADJUSTMENT options to change this behavior.
Rich Gillam January 11, 2024 at 5:57 PM
says there’s a notation in SpecialCasing.txt dealing with this specific issue, and that it says the string needs to be normalized first (moving the iota subscript to the end) if casing is to work correctly. We don’t believe that u_strToTitle()
is documentated to normalize before performing case conversion, so that’s something the caller is supposed to do first.
Returning as “Working as designed”, although the ICU-TC generally thinking maybe it shouldn’t be designed this way.
jdavis November 10, 2023 at 3:48 PM(edited)
I can’t edit the issue, but I think the above description may be fine. Let me start over:
A string consisting only of U+0345 (iota-subscript) is not changed when titlecased with u_strToTitle(). U+0345 is marked as Cased and Changes_When_Titlecased, so I can’t see any reason why ICU is not using the titlecase form.
I’d expect the string U+0251, U+0345, U+0301 (“ɑ́ͅ“ <alpha><iota_subscript><acute>) to be titlecased as U+2C6D, U+0301, U+0399 (“Ɑ́Ι“ <ALPHA><acute><IOTA>), as described in SpecialCasing.txt in a comment. But with the root locale and the default break iterator, u_strToTitle() returns U+2C6D, U+0345, U+0301 (“Ɑ́ͅ“ <ALPHA><iota_subscript><acute>).