uloc_getCharacterOrientation should consider script

Description

anab pointed out the inconsistent result in https://github.com/tc39/proposal-intl-locale-info/issues/88

Somehow uloc_isRightToLeft consider the Script and determine the return value based on the script after applying likly subtag but uloc_getCharacterOrientation ignore the script

Here are some test code

diff --git a/icu4c/source/test/cintltst/cloctst.c b/icu4c/source/test/cintltst/cloctst.c index 8dbf04572da..c0b43cdfcb5 100644 --- a/icu4c/source/test/cintltst/cloctst.c +++ b/icu4c/source/test/cintltst/cloctst.c @@ -3484,7 +3484,13 @@ static void TestOrientation(void) { "ar", ULOC_LAYOUT_RTL, ULOC_LAYOUT_TTB }, { "aR", ULOC_LAYOUT_RTL, ULOC_LAYOUT_TTB }, { "ar_Arab", ULOC_LAYOUT_RTL, ULOC_LAYOUT_TTB }, + { "en_Arab", ULOC_LAYOUT_RTL, ULOC_LAYOUT_TTB }, + { "en_Hebr", ULOC_LAYOUT_RTL, ULOC_LAYOUT_TTB }, { "fa", ULOC_LAYOUT_RTL, ULOC_LAYOUT_TTB }, + { "fa_Cyrl", ULOC_LAYOUT_LTR, ULOC_LAYOUT_TTB }, + { "tr", ULOC_LAYOUT_LTR, ULOC_LAYOUT_TTB }, + { "tr_Latn", ULOC_LAYOUT_LTR, ULOC_LAYOUT_TTB }, + { "tr_Arab", ULOC_LAYOUT_RTL, ULOC_LAYOUT_TTB }, { "Fa", ULOC_LAYOUT_RTL, ULOC_LAYOUT_TTB }, { "he", ULOC_LAYOUT_RTL, ULOC_LAYOUT_TTB }, { "ps", ULOC_LAYOUT_RTL, ULOC_LAYOUT_TTB }, @@ -3500,6 +3506,7 @@ static void TestOrientation(void) const char* const localeId = toTest[i].localeId; const ULayoutType co = uloc_getCharacterOrientation(localeId, &statusCO); const ULayoutType expectedCO = toTest[i].character; + UBool cRightToLeft = uloc_isRightToLeft(localeId); const ULayoutType lo = uloc_getLineOrientation(localeId, &statusLO); const ULayoutType expectedLO = toTest[i].line; if (U_FAILURE(statusCO)) { @@ -3508,12 +3515,22 @@ static void TestOrientation(void) localeId, u_errorName(statusCO)); } - else if (co != expectedCO) { + else { + if (co != expectedCO) { log_err( " unexpected result for uloc_getCharacterOrientation(), with localeId \"%s\". Expected %s but got result %s\n", localeId, ULayoutTypeToString(expectedCO), ULayoutTypeToString(co)); + } + if (cRightToLeft != (co == ULOC_LAYOUT_RTL)) { + log_err( + " inconsistent result between uloc_getCharacterOrientation() and uloc_isRightToLeft, with localeId \"%s\". uloc_getCharacterOrientation return %s but got uloc_isRightToLeft return %s\n", + localeId, + ULayoutTypeToString(co), + cRightToLeft ? "True" : "False" + ); + } } if (U_FAILURE(statusLO)) { log_err_status(statusLO,

 

and the result

LD_LIBRARY_PATH=../../lib:../../stubdata:../../tools/ctestfw:$LD_LIBRARY_PATH ./cintltst /tsutil/cloctst/TestOrientation Default locale for this run is en_US Selecting subtree '/tsutil/cloctst/TestOrientation' TestOrientation { !! unexpected result for uloc_getCharacterOrientation(), with localeId "en_Arab". Expected ULOC_LAYOUT_RTL but got result ULOC_LAYOUT_LTR !! inconsistent result between uloc_getCharacterOrientation() and uloc_isRightToLeft, with localeId "en_Arab". uloc_getCharacterOrientation return ULOC_LAYOUT_LTR but got uloc_isRightToLeft return True !! unexpected result for uloc_getCharacterOrientation(), with localeId "en_Hebr". Expected ULOC_LAYOUT_RTL but got result ULOC_LAYOUT_LTR !! inconsistent result between uloc_getCharacterOrientation() and uloc_isRightToLeft, with localeId "en_Hebr". uloc_getCharacterOrientation return ULOC_LAYOUT_LTR but got uloc_isRightToLeft return True !! unexpected result for uloc_getCharacterOrientation(), with localeId "tr_Arab". Expected ULOC_LAYOUT_RTL but got result ULOC_LAYOUT_LTR !! inconsistent result between uloc_getCharacterOrientation() and uloc_isRightToLeft, with localeId "tr_Arab". uloc_getCharacterOrientation return ULOC_LAYOUT_LTR but got uloc_isRightToLeft return True } ---[6 ERRORS in TestOrientation] (6ms) (6ms) SUMMARY: ******* [Total error count: 6]

 

It seems uloc_getCharacterOrientation is getting the informatino from CLDR

<layout><orientation><characterOrder>

but <characterOrder> is encoded per locale w/o Script override

so for the case of

en-Arab

en-Hebr

tr-Arab

somehow even they are written in Arab or Hebr script, uloc_getCharacterOrientation still return the result from the fallback result from the en or tr locale, which is left-to-right intsead of right-to-left.

 

I wonder why <layout><orientation><characterOrder> is encoded per locale instead of encode per script?

The implementation of uloc_isRightToLeft use the script tag, if without, first add likely subtag to get the script tag. and determien the value based on uscript_isRightToLeft

Activity

Markus Scherer 
March 20, 2025 at 4:59 PM

Recommend isRightToLeft().

Rich: We could change getCharacterOrientation() so that when it falls back too far (just to root??) it could call isRightToLeft().

Frank Yung-Fong Tang 
January 23, 2025 at 7:44 PM
(edited)

Also, be aware there are also some scripts have UNKNOWN direction

https://github.com/unicode-org/cldr/blob/main/common/properties/scriptMetadata.txt#L58C1-L59C1

Zyyy; 1; 0040; ZZ; -1; RECOMMENDED; UNKNOWN; UNKNOWN; UNKNOWN; UNKNOWN; UNKNOWN ... Zinh; 24; 030F; ZZ; -1; RECOMMENDED; UNKNOWN; UNKNOWN; MIN; UNKNOWN; UNKNOWN ... Zzzz; 31; FDD0; ZZ; -1; UNKNOWN; UNKNOWN; UNKNOWN; UNKNOWN; UNKNOWN; UNKNOWN ... Brai; 33; 280E; FR; -1; UNKNOWN; UNKNOWN; UNKNOWN; UNKNOWN; UNKNOWN; UNKNOWN

Mark Davis 
January 16, 2025 at 9:00 PM

My first thought was that there was a missing orientation element, eg:

<characterOrder>right-to-left</characterOrder>

However, if someone is getting the orientation from a locale that does not exist in CLDR, like en_Arab, then you’d think it would be derived from the script instead.

However, if there is no data for en_Arab in CLDR, then it would inherit from en, and the i18n data would be left-to-right. So the best choice is unclear to me now.

Rich Gillam 
January 16, 2025 at 5:26 PM

Mark thinks this may be a bug in CLDR. do you agree? Should we move it over there?

Markus Scherer 
December 19, 2024 at 6:35 PM

LDML: <!ELEMENT orientation ( characterOrder*, lineOrder*, special* ) > https://www.unicode.org/reports/tr35/tr35-general.html#Layout_Elements

I don’t actually see a distinction between vertical text where columns go right-to-left (CJK) vs. columns go left-to-right (Mongolian). Is this not relevant for CLDR/ICU?

LDML: https://www.unicode.org/reports/tr35/#Script_Metadata → CLDR https://github.com/unicode-org/cldr/blob/main/common/properties/scriptMetadata.txt

This just has a boolean for LTR vs. RTL.

Details

Assignee

Reporter

Components

Priority

Created December 18, 2024 at 7:44 PM
Updated March 20, 2025 at 4:59 PM