We're updating the issue view to help you get more done. 

BreakIterator.getWordInstance doesn't always return the same boundaries

Description

I found a case when BreakIterator.getWordInstance doesn't always return the same boundaries for a given input string. Here is an example:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 String t = "口訣 ーデキモサラェレョ゠イ ゠h"; BreakIterator it = BreakIterator.getWordInstance(ULocale.ROOT); it.setText(t); List<Integer> boundaries = new ArrayList<>(); for (int i = it.next(); i != BreakIterator.DONE; i = it.next()) { boundaries.add(i); } it.setText(t); List<Integer> boundaries2 = new ArrayList<>(); for (int i = it.next(); i != BreakIterator.DONE; i = it.next()) { boundaries2.add(i); } System.out.println(boundaries); System.out.println(boundaries2);

This prints the following with both ICU4J 59.1 and 61.1.

1 2 [2, 3, 4, 5, 7, 9, 10, 11, 14, 15, 16, 17] [2, 3, 14, 15, 16, 17]

This problem seems to boil down to RuleBasedBreakIterator.getLanguageBreakEngine, which doesn't return the same engine for the same character during the two runs. The following happens:

  • the first korean characters trigger the addition of the CJK break engine to the list of break engines

  • then the CJK break engine is selected for "ー"

  • then "゠" triggers the addition of more characters to the unhandled break engine.

Unfortunately "ー" is one of the characters that are added to the unhandled break engine at the last step so since it has precedence over the CJK break engine, the unhandled break engine gets selected for "ー" on the second run, not the CJK engine like the first time.

Note that this string doesn't come from real data, it is a substring of a larger string that was generated randomly when running the Lucene test suite.

Status

Assignee

Andy Heninger

Reporter

TracBot

Components

Fix versions

Priority

medium