FilteredBreakIterator is not correctly handling many suppressions with dot in the middle; it finds a break in the middle of the following for example:
"G8 countries e.g. France, Germany"
There are two problems:
SimpleFilteredBreakIteratorBuilder::build is not adding all of the necessary entries to the reverse trie. For a suppressions entry like "E.g." it adds the "E." prefix to the reverse trie as a partial match, but suppresses the full entry "E.g." from the reverse trie. The standard (delegate) break iterator finds a break before the F in "E.g. F" and then checks for data in the reverse trie, starting in reverse order with . and g, and fails at the g. What should happen for an entry like "E.g." is:
Add "E." to the reverse trie as a partial match
Add "E.g." to the reverse trie as a full match
If there is another suppressions entry "E." by itself (which there is), suppress adding that to the reverse trie, but do add it to the forward trie.
The other problem is some case variants of such suppression terms are missing; the en data has E.G, E.g., and I.e. but not e.g., I.E. and i.e.
A sample fix for the first problem is to make separate use of the kSuppressInReverse and kAddToForward bits, which are currently not used independently:
Some tests I added for this and other ss= behavior in the Apple version of ICU include, for en@ss=standard:
Taking this one.