I found a case when BreakIterator.getWordInstance doesn't always return the same boundaries for a given input string. Here is an example:
This prints the following with both ICU4J 59.1 and 61.1.
This problem seems to boil down to RuleBasedBreakIterator.getLanguageBreakEngine, which doesn't return the same engine for the same character during the two runs. The following happens:
the first korean characters trigger the addition of the CJK break engine to the list of break engines
then the CJK break engine is selected for "ー"
then "゠" triggers the addition of more characters to the unhandled break engine.
Unfortunately "ー" is one of the characters that are added to the unhandled break engine at the last step so since it has precedence over the CJK break engine, the unhandled break engine gets selected for "ー" on the second run, not the CJK engine like the first time.
Note that this string doesn't come from real data, it is a substring of a larger string that was generated randomly when running the Lucene test suite.