We're updating the issue view to help you get more done. 

Inconsistent word and character boundaries

Description

I'm using ICU's break iterators for both characters and words as described in here. I expect the output of character break iterator stops more frequently and the break-points be a superset of that of word break iterator. For instance, if I pass abc, I get a, b, and c from character break iterator while I get abc from word break iterator.

Now, I have a Thai string as ด้าน้ำ. The problem is that the behavior of these two break iterators are inconsistent. Given the length of the above string is 6 in Unicode, I get these results from ICU 61.1 on MacOS:

1 2 3 4 5 6 7 Word boundaries: [0, 5) [5, 6) Character boundaries: [0, 2) [2, 3) [3, 6)

As you can see, character break operator breaks the word in [3, 6) (which seems correct), while word break operator breaks it in [5, 6). Here's a small Python3 code which uses PyICU to repro the issue:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 import PyICU def wordBreakIterator(): return PyICU.BreakIterator.createWordInstance(PyICU.Locale("th")) def charBreakIterator(): return PyICU.BreakIterator.createCharacterInstance(PyICU.Locale("th")) def printBoundaries(txt, bi): bi.setText(txt) start = bi.first() try: while True: end = next(bi) print("[{}, {})".format(start, end)) start = end except StopIteration: pass if __name__ == "__main__": text = u'ด้าน้ำ' print("Word boundaries:") printBoundaries(text, wordBreakIterator()) print("Character boundaries:") printBoundaries(text, charBreakIterator())

Status

Assignee

Andy Heninger

Reporter

TracBot

Components

Fix versions

Priority

medium