Fallback to grapheme boundary while no dictionary is available

Description

From some previous doc:

Consider applying grapheme cluster iteration for breaking when dictionary for words is not present.
This is important to handle when data slicing may have removed the appropriate dictionary file.
This will have a big impact if dictionary files can be omitted.

Need to figure out the framework of how to write unit tests to test such behavior

Activity

Show:
Frank Yung-Fong Tang
September 23, 2020, 6:36 PM

move future

Frank Yung-Fong Tang
May 9, 2020, 6:07 AM

I tried the following patch to use the grapheme break iterator in the UnhandledEngine, but it won’t work well. The problem is Some Hiragana/Katakana marks are also handled by UnhandledEngine now. Need more work to restrict the code getting there do not include those.

Frank Yung-Fong Tang
May 9, 2020, 2:30 AM

What we can do is for the text which have dictionary bit on, if there are no dictionary engine cover that, apply grapheme break

To do this work, we can use those language which are in complex script but not yet have dictionary- for example TAI THAM

Here are some TAI THAM text we can use as for test case

https://r12a.github.io/scripts/taitham/

we can write the test case in a way that the line break and word beak should match grapheme break if there are no SPACE or other text in the test cases.

Steven R. Loomis
March 11, 2020, 6:35 PM

maybe a fallback warning or some info? the dictionary is critical for say Thai

Your pinned fields
Click on the next to a field label to start pinning.

Assignee

Frank Yung-Fong Tang

Reporter

Frank Yung-Fong Tang

Components

Labels

Priority

medium

Time Needed

Days

Fix versions