FilteredBreakIterator not handling suppressions like "e.g." (with dot in the middle)

Description

FilteredBreakIterator is not correctly handling many suppressions with dot in the middle; it finds a break in the middle of the following for example:

  • "G8 countries e.g. France, Germany"

There are two problems:

  • SimpleFilteredBreakIteratorBuilder::build is not adding all of the necessary entries to the reverse trie. For a suppressions entry like "E.g." it adds the "E." prefix to the reverse trie as a partial match, but suppresses the full entry "E.g." from the reverse trie. The standard (delegate) break iterator finds a break before the F in "E.g. F" and then checks for data in the reverse trie, starting in reverse order with . and g, and fails at the g. What should happen for an entry like "E.g." is:

  • Add "E." to the reverse trie as a partial match

  • Add "E.g." to the reverse trie as a full match

  • If there is another suppressions entry "E." by itself (which there is), suppress adding that to the reverse trie, but do add it to the forward trie.

  • The other problem is some case variants of such suppression terms are missing; the en data has E.G, E.g., and I.e. but not e.g., I.E. and i.e.

A sample fix for the first problem is to make separate use of the kSuppressInReverse and kAddToForward bits, which are currently not used independently:

Activity

Show:
TracBot
June 30, 2018, 11:42 PM
Trac Comment 2 by —2016-04-27T18:31:45.264Z

Some tests I added for this and other ss= behavior in the Apple version of ICU include, for en@ss=standard:

Peter Edberg
September 6, 2020, 6:10 PM

Taking this one.

Assignee

Peter Edberg

Reporter

Peter Edberg

Components

Labels

Reviewer

None

Priority

assess

Time Needed

None

Fix versions

Configure