I'm using ICU's break iterators for both characters and words as described in here. I expect the output of character break iterator stops more frequently and the break-points be a superset of that of word break iterator. For instance, if I pass abc, I get a, b, and c from character break iterator while I get abc from word break iterator.
Now, I have a Thai string as ด้าน้ำ. The problem is that the behavior of these two break iterators are inconsistent. Given the length of the above string is 6 in Unicode, I get these results from ICU 61.1 on MacOS:
As you can see, character break operator breaks the word in [3, 6) (which seems correct), while word break operator breaks it in [5, 6). Here's a small Python3 code which uses PyICU to repro the issue: