'line' break iterator provides incorrect boundaries when evaluating CJ graphemes followed by Ideographic spaces (U+3000)

Description

Lets consider the following case: ふぇ 

Assume that we have a* width of 2 characters*, and let's use the "|" symbol to mark the breaking opportunity finally selected when rendering the text.

The ICU breaker resolves the following boundaries (fBoundaries array) [0, 1, 3, ... ] which ends up producing this result: "ふ|ぇ "

However, there should be a breaking opportunity after "ぇ" (at least it is what I get if the line was "ふぇぇ"), which generates [0, 1, 2, ...] boundaries array, resulting in "ふぇ|ぇ" when rendering the text with the same width constraints.

Assuming normal line-breaking rules, the unicode spec states that CJ characters should be handled as ID class:

https://www.unicode.org/reports/tr14/#CJ
" Characters of this class may be treated as either NS or ID."

Handling CJ graphemes as ID class should imply that there are breaking opportunities before and after any of this characters. The presence of an Ideographic space (U+3000), which is a BA class grapheme, should not forbid or prevent the breaking opportunity after ぇ to be used

Hence, I consider that the boundaries array in the original case defined in this issue should be [0, 1, 2, ...] .

Status

Assignee

Craig Cornelius

Reporter

Javier Fernandez

Labels

None

Reviewer

None

Time Needed

None

Start date

None

Components

Priority

TBD
Configure