rewrite data building in python


Design Doc:

Filter file documentation:

  • for 'prebuilt data' (packaged icu-src.tgz/zip, reading from icu/source/data/in/*.dat) - don't use makefiles at all for building data. Put the "swap endianness and build a DLL/etc" default behavior into tools.

  • for 'build from source' data: '''prereq python''', rewrite the build for Win/Posix and for test data to all be in python, make it more flexible/readable/maintainable


Shane Carr
February 15, 2019, 7:58 PM

I'm taking a pass at these today to get them in before 64. I plan to do three action items.

Action Item: I can do #2, the wildcard match for resource paths.

Action Item: Since Jungshik says #7 is high priority, I can also try to get that to work. The implementation I can think of is to just copy all the input files into another file tree and then drop in the substitutions before processing. I believe #7 covers #6 and #8.

Action Item: I'll see if I can reproduce the code error in #3.

Seems that doing #1 and #4 via a patch (not part of buildtool) is not controversial.

#5 is resolved I believe.

Jungshik Shin
January 11, 2019, 2:10 AM

#7 and #8 are about the same issue for Chromium.

Jungshik Shin
January 11, 2019, 2:09 AM

Handling #7 for Chromium would be ugly and shifts the cost benefit balance significantly to the cost side.

Shane Carr
January 10, 2019, 10:48 PM

Andy replies:

5. "The \N{NAME} syntax isn't used in break iterator rules, but is used in tests for break iterator and regex. And, it looks like it's in tests for UnicodeSet and Transliterator also."

8. "Yes. Thesource rule strings aren't used for anything in the Breakiterator implementation. We do test that building a break iterator from the rule string yields identical binary data to the original, for the standard break iterator types. In ICU4J, this is the most substantial testing that the rule builder gets, since the main break iterator data is imported as binary from the ICU4C build."

Markus replies saying that it is OK to break test code when data is filtered.

Shane Carr
January 10, 2019, 10:34 PM

Let's give file substitution a number #7.

Markus says that:

  • #1, #4, and #7 are perhaps better handled by the user.

  • On #2, these rules have to be processed by genrb, so simpler is better. Maybe just a simple * that matches any single part of a path, with an optional prefix and suffix.

  • On #3, if data fallback is broken for some reason, that should be fixed in the code. This includes root.txt in translit: "we should see if it works for root to contain nothing, rather than an empty RuleBasedTransliteratorIDs table."

  • On #5, "It should be possible to build without It is possible that some files use syntax like \N{LATIN CAPITAL LETTER L} but we could change that to \u005C with a comment where it's obscure. I suspect that Andy might use name syntax in BreakIterator rules."

Additional comment; I will assign it a number:

8. "We should really also add an option to omit BreakIterator rule strings. I believe they are similiarly unnecessary after building the binary data. ... making BreakIterator work right when one or more dictionaries are omitted will likely be a common request. Somewhere between the filter logic, the BreakIterator code, and the BreakIterator rules, we should look at finding a way to make this easy. Maybe during rule compilation check if some file exists and pick one or the other sub-rule (defining some set of characters for dict breaking vs. UAX #29 behavior). (I didn't run this by anyone yet.)"



Shane Carr


Steven R. Loomis




Time Needed


Fix versions