ICU4J API Bidi.MAX_EXPLICIT_LEVEL should not be final static

Description

IIUC, Bidi.MAX_EXPLICIT_LEVEL is subject to change to due to Unicode algorthim change. It was changed from 61 to 125 in the commit 748e8c9cc6ae9591480f2573ec60ec5fc6d1d0fd..

However, during compile time, the value is inlined in compile time, but runtime may have different value.

2 suggestions
1. Can we add a static method Bidi.getMaxExplicitLevel in ICU4J?
2. Are some API doc outdated? For example, the java doc of Bidi.setPara shows
"Negative values from -1 to -62 indicate overrides at the absolute value of the level. Positive values from 1 to 62 indicate embeddings.", but "maximum resolved level can be up to MAX_EXPLICIT_LEVEL+1". It seems 62 is an outdated maximum value.

Perhaps, developers look at the specification Unicode Standard Annex #9 when implementing bidi algorithm, but fixing the API doc will be useful for developers implementing based on different Unicode versions on Android.

Activity

Show:

Markus Scherer September 20, 2018 at 4:19 PM

> If it really happens in the future, Android needs some implementation in the deprecated API.

Of course, and not just Android. Deprecated ICU API retains functionality but does not extend it.

Victor Chang September 20, 2018 at 11:29 AM

Thanks for following up this! Just one comment.

> In other words, even if Unicode did increase the max_depth value in the spec, we would not be able to change these parts of the API. We would have to keep them as is, probably deprecate them, and replace with different API.
If it really happens in the future, Android needs some implementation in the deprecated API.

Markus Scherer September 19, 2018 at 6:52 PM

I proposed to the Unicode Technical Committee to provide a guarantee that the UBA max_depth value of 125 will never change.
https://www.unicode.org/L2/L2018/18296-freeze-bidi-max_depth.pdf

The UTC agreed to say in version 12 of the UBA (UAX #9, http://www.unicode.org/reports/tr9/) that the value will never change, but there was no consensus for making this even stronger by adding a stability policy for it. (https://www.unicode.org/policies/stability_policy.html)

However, there are several parts of the API, including LEVEL_DEFAULT_LTR, LEVEL_OVERRIDE, and setPara() taking byte and byte[] values, that cannot work with a MAX_EXPLICIT_LEVEL greater than 125.

In other words, even if Unicode did increase the max_depth value in the spec, we would not be able to change these parts of the API. We would have to keep them as is, probably deprecate them, and replace with different API.

Markus Scherer September 13, 2018 at 4:57 AM

> Will the value of 125 get changed in the future?

Someone could propose doing so, and the UTC might accept it.

However, in 2013 when the UTC had agreed to increase this from 61 to 251 (apparently because the isolates mechanism being added then may increase the number of levels), I pushed back and laid out how 125 would be the maximum that ICU can handle without massively disrupting at least its C API. It also would have caused trouble for Unicode's reference implementation which used signed Java bytes, but that would have been easier to fix.

The UTC agreed and raised max_depth to only 125.

The editors did not add the rationale for 125 into UAX #9, so if someone wants to raise it again (although it has already gone from a "far more than sufficient" 15 to 61 and then 125) and if then no one remembers the ICU limitation, it could be raised and it would cause a lot of pain for ICU.


I discussed this with the UAX #9 editors in email "Lowering bidi's max_depth down to 125" on 2013-may-16 (not on a mailing list). I will just quote snippets from myself for now, except to paraphrase that one reason 125 was accepted was that going higher would have caused problems for any implementation using signed byte values.


... the ICU4C UBiDiLevel is a uint8_t, and something like half of the API takes such values on input and output, including input and output of arrays of these – that is, byte arrays.

Bit seven in the level byte is an "override" flag (UBIDI_LEVEL_OVERRIDE=0x80). That limits us to seven bits.

Also, we have constants UBIDI_DEFAULT_LTR=0xfe and UBIDI_DEFAULT_RTL=0xff. That limits us to 0..125.

For input, where the caller may provide explicit levels, going from 0..61 to 0..125 should work fine.

However, for output, the resolved level can be one higher, currently 62. (See UBA X1, I1, I2.) With a maximum explicit level of 125 we may resolve up to 126.


I chose the level encoding and designed the API, in 1999, with Unicode 3.0 hot off the press. At the time, 7 bits seemed plenty, because

a) Unicode 3.0 changed the maximum explicit level from 15 to 61, and Mark told me at the time that this was felt to be way more than would be needed. (Compare revisions 2 & 3 of tr9.html)

b) UAX #9 states

Embedding levels are explicitly set by both override format codes and by embedding format codes; higher numbers mean the text is more deeply nested. The reason for having a limitation is to provide a precise stack limit for implementations to guarantee the same results. Sixty-one levels is far more than sufficient for ordering, even with mechanically generated formatting; the display becomes rather muddied with more than a small number of embeddings.

In other words, the UTC has told implementers since 1999 that they can design for explicit levels 0..61 "to provide a precise stack limit", but surely that made it also legitimate to choose data types accordingly.

I find it problematic to go beyond 61, based on the strong language in UAX #9 3.0-6.2. I guess you want to do so because some implementation has found 61 levels not quite "far more than sufficient for ordering, even with mechanically generated formatting", or because you anticipate the introduction of isolates to add to the nesting.

For ICU, 0..125 should work. (Roozbeh can double-check on 125 vs. 124.)

For the Java reference implementation, 0..125 appears to work, too. With more than 7 bits, the reference implementation needs to be changed more extensively.

For other implementations, who knows, they might use 32-bit ints, or it might be painful to go beyond 61.

Victor Chang September 11, 2018 at 1:52 PM

One key question from me: Will the value of 125 get changed in the future?
According to the Unicode spec, it's 125. But it's unclear to me if the value is subject to change in the future.
http://unicode.org/reports/tr9/#BD2

Fixed

Details

Assignee

Reporter

Components

Labels

Priority

Fix versions

Created September 7, 2018 at 12:49 PM
Updated September 21, 2018 at 11:55 PM
Resolved September 21, 2018 at 11:55 PM