RBBI Rule Size Reductions

Description

Some ideas for reducing the size of break iterator rule files

  • Use bytes rather than 16 bit values in the state table, when a byte is enough. Which it is for our standard rule types. (The ICU 60 line break table is 59 char classes by 171 states, a possible 10kB savings)

  • Remove fluff from the stored rule string. Remove extra spaces, unescape \u escaped non-syntax characters in the rules. Possibly store as UTF-8.

  • Markus is considering a byte-valued Trie table, which again would be enough for our standard break types.

Activity

Show:
Shane Carr
April 8, 2020, 6:50 AM

Nice!

Frank Yung-Fong Tang
April 13, 2020, 5:34 PM

 

Markus Scherer
May 27, 2020, 7:17 PM

I suggest we keep using this ticket for follow-up PRs after #1100 unless they have their own, very specific tickets already.

Frank Yung-Fong Tang
September 9, 2020, 6:21 PM

I think we should close this BUG for what we already did for 68 since we make a lot of changes in this area in 68. If we have other minor improvement we could do for post 68, we should file a different bug for those.

Andy Heninger
September 9, 2020, 7:05 PM

I agree with Frank for closing this bug as fixed. Especially since all of the original suggestions are done.

Fixed

Assignee

Frank Yung-Fong Tang

Reporter

Andy Heninger

Components

Labels

Reviewer

None

Priority

major

Time Needed

None

Fix versions

Configure