ICU Word Breaking is Too Useful for Spell-Checking in Scriptio Continua Writing Systems


The Unicode standard (up to 7.0, at least) documents WJ and ZWNBSP as suppressing word breaks. This can be of use in spell-checking for writing systems in which word boundaries cannot be readily detected, such as Thai, Lao, Cambodian and Chinese. Belatedly following on from the defeated attempt to remove the word-breaking property of ZWSP (see Javier Sola's account at http://www.unicode.org/mail-arch/unicode-ml/y2009-m01/0604.html), which would have been a much severer blow, the Unicode editorial committee has declared that WJ and ZWNBSP do not affect word-breaking, and that the text stating that it does is in error. This decision is due to be ratified at the UTC meeting in July 2015.

WJ and ZWNBSP suppress word-breaking in the Thai word-breaker for ICU, contrary to what has been declared to be the correct behaviour. A suitable test string is ไม่รู้อะไร (no invisible characters), which is assigned a word and line break between ไม่รู้ and อะไร. Inserting WJ, ZWNBSP or even CGJ between these two words will suppress both the word break and the line break. I have not checked ICU line-breaker behaviour for Chinese or Khmer, and have only checked the behaviour in ICU 53. I have been exploiting it in LibreOffice since it switched to using the ICU line-breaker for Thai.

This behaviour should be removed to make spell-checking and correction of documents harder for users of the Chinese, Thai and Khmer scripts. Failure to tackle the issue with the behaviour of CGJ would leave indirect users of ICU word-breaking with a back door to maintain their current capability.

Providing an option to retain the currently documented behaviour would avoid the temptation to fork the ICU word-breaker to continue to provide better support scriptio continua writing systems. Presumably such support is contrary to the intentions of the powers that be. (I am following the English legal maxim that a man intends the reasonably foreseeable consequences of his actions.)



Andy Heninger






