uloc_getCharacterOrientation should consider script
Description
Activity
Markus Scherer March 20, 2025 at 4:59 PM
Recommend isRightToLeft().
Rich: We could change getCharacterOrientation() so that when it falls back too far (just to root??) it could call isRightToLeft().
Frank Yung-Fong Tang January 23, 2025 at 7:44 PM(edited)
Also, be aware there are also some scripts have UNKNOWN direction
https://github.com/unicode-org/cldr/blob/main/common/properties/scriptMetadata.txt#L58C1-L59C1
Zyyy; 1; 0040; ZZ; -1; RECOMMENDED; UNKNOWN; UNKNOWN; UNKNOWN; UNKNOWN; UNKNOWN
...
Zinh; 24; 030F; ZZ; -1; RECOMMENDED; UNKNOWN; UNKNOWN; MIN; UNKNOWN; UNKNOWN
...
Zzzz; 31; FDD0; ZZ; -1; UNKNOWN; UNKNOWN; UNKNOWN; UNKNOWN; UNKNOWN; UNKNOWN
...
Brai; 33; 280E; FR; -1; UNKNOWN; UNKNOWN; UNKNOWN; UNKNOWN; UNKNOWN; UNKNOWN
Mark Davis January 16, 2025 at 9:00 PM
My first thought was that there was a missing orientation element, eg:
<characterOrder>right-to-left</characterOrder>
However, if someone is getting the orientation from a locale that does not exist in CLDR, like en_Arab, then you’d think it would be derived from the script instead.
However, if there is no data for en_Arab in CLDR, then it would inherit from en, and the i18n data would be left-to-right. So the best choice is unclear to me now.
Rich Gillam January 16, 2025 at 5:26 PM
Mark thinks this may be a bug in CLDR. @Markus Scherer do you agree? Should we move it over there?
Markus Scherer December 19, 2024 at 6:35 PM
LDML: <!ELEMENT orientation ( characterOrder*, lineOrder*, special* ) >
https://www.unicode.org/reports/tr35/tr35-general.html#Layout_Elements
I don’t actually see a distinction between vertical text where columns go right-to-left (CJK) vs. columns go left-to-right (Mongolian). Is this not relevant for CLDR/ICU?
LDML: https://www.unicode.org/reports/tr35/#Script_Metadata → CLDR https://github.com/unicode-org/cldr/blob/main/common/properties/scriptMetadata.txt
This just has a boolean for LTR vs. RTL.
Details
Details
Assignee
Reporter
Components
Priority

anab pointed out the inconsistent result in https://github.com/tc39/proposal-intl-locale-info/issues/88
Somehow uloc_isRightToLeft consider the Script and determine the return value based on the script after applying likly subtag but uloc_getCharacterOrientation ignore the script
Here are some test code
diff --git a/icu4c/source/test/cintltst/cloctst.c b/icu4c/source/test/cintltst/cloctst.c index 8dbf04572da..c0b43cdfcb5 100644 --- a/icu4c/source/test/cintltst/cloctst.c +++ b/icu4c/source/test/cintltst/cloctst.c @@ -3484,7 +3484,13 @@ static void TestOrientation(void) { "ar", ULOC_LAYOUT_RTL, ULOC_LAYOUT_TTB }, { "aR", ULOC_LAYOUT_RTL, ULOC_LAYOUT_TTB }, { "ar_Arab", ULOC_LAYOUT_RTL, ULOC_LAYOUT_TTB }, + { "en_Arab", ULOC_LAYOUT_RTL, ULOC_LAYOUT_TTB }, + { "en_Hebr", ULOC_LAYOUT_RTL, ULOC_LAYOUT_TTB }, { "fa", ULOC_LAYOUT_RTL, ULOC_LAYOUT_TTB }, + { "fa_Cyrl", ULOC_LAYOUT_LTR, ULOC_LAYOUT_TTB }, + { "tr", ULOC_LAYOUT_LTR, ULOC_LAYOUT_TTB }, + { "tr_Latn", ULOC_LAYOUT_LTR, ULOC_LAYOUT_TTB }, + { "tr_Arab", ULOC_LAYOUT_RTL, ULOC_LAYOUT_TTB }, { "Fa", ULOC_LAYOUT_RTL, ULOC_LAYOUT_TTB }, { "he", ULOC_LAYOUT_RTL, ULOC_LAYOUT_TTB }, { "ps", ULOC_LAYOUT_RTL, ULOC_LAYOUT_TTB }, @@ -3500,6 +3506,7 @@ static void TestOrientation(void) const char* const localeId = toTest[i].localeId; const ULayoutType co = uloc_getCharacterOrientation(localeId, &statusCO); const ULayoutType expectedCO = toTest[i].character; + UBool cRightToLeft = uloc_isRightToLeft(localeId); const ULayoutType lo = uloc_getLineOrientation(localeId, &statusLO); const ULayoutType expectedLO = toTest[i].line; if (U_FAILURE(statusCO)) { @@ -3508,12 +3515,22 @@ static void TestOrientation(void) localeId, u_errorName(statusCO)); } - else if (co != expectedCO) { + else { + if (co != expectedCO) { log_err( " unexpected result for uloc_getCharacterOrientation(), with localeId \"%s\". Expected %s but got result %s\n", localeId, ULayoutTypeToString(expectedCO), ULayoutTypeToString(co)); + } + if (cRightToLeft != (co == ULOC_LAYOUT_RTL)) { + log_err( + " inconsistent result between uloc_getCharacterOrientation() and uloc_isRightToLeft, with localeId \"%s\". uloc_getCharacterOrientation return %s but got uloc_isRightToLeft return %s\n", + localeId, + ULayoutTypeToString(co), + cRightToLeft ? "True" : "False" + ); + } } if (U_FAILURE(statusLO)) { log_err_status(statusLO,
and the result
LD_LIBRARY_PATH=../../lib:../../stubdata:../../tools/ctestfw:$LD_LIBRARY_PATH ./cintltst /tsutil/cloctst/TestOrientation Default locale for this run is en_US Selecting subtree '/tsutil/cloctst/TestOrientation' TestOrientation { !! unexpected result for uloc_getCharacterOrientation(), with localeId "en_Arab". Expected ULOC_LAYOUT_RTL but got result ULOC_LAYOUT_LTR !! inconsistent result between uloc_getCharacterOrientation() and uloc_isRightToLeft, with localeId "en_Arab". uloc_getCharacterOrientation return ULOC_LAYOUT_LTR but got uloc_isRightToLeft return True !! unexpected result for uloc_getCharacterOrientation(), with localeId "en_Hebr". Expected ULOC_LAYOUT_RTL but got result ULOC_LAYOUT_LTR !! inconsistent result between uloc_getCharacterOrientation() and uloc_isRightToLeft, with localeId "en_Hebr". uloc_getCharacterOrientation return ULOC_LAYOUT_LTR but got uloc_isRightToLeft return True !! unexpected result for uloc_getCharacterOrientation(), with localeId "tr_Arab". Expected ULOC_LAYOUT_RTL but got result ULOC_LAYOUT_LTR !! inconsistent result between uloc_getCharacterOrientation() and uloc_isRightToLeft, with localeId "tr_Arab". uloc_getCharacterOrientation return ULOC_LAYOUT_LTR but got uloc_isRightToLeft return True } ---[6 ERRORS in TestOrientation] (6ms) (6ms) SUMMARY: ******* [Total error count: 6]
It seems
uloc_getCharacterOrientation
is getting the informatino from CLDR<layout><orientation><characterOrder>
but <characterOrder> is encoded per locale w/o Script override
so for the case of
en-Arab
en-Hebr
tr-Arab
somehow even they are written in Arab or Hebr script,
uloc_getCharacterOrientation
still return the result from the fallback result from the en or tr locale, which is left-to-right intsead of right-to-left.I wonder why <layout><orientation><characterOrder> is encoded per locale instead of encode per script?
The implementation of uloc_isRightToLeft use the script tag, if without, first add likely subtag to get the script tag. and determien the value based on uscript_isRightToLeft