simplify Unicode tools

Description

Deleted Component: tools

The Unicode (UCD/UCA) update process is too complicated, see source/data/unidata/changes.txt. Try to simplify it. For ideas see http://site.icu-project.org/design/props/ppucd

Activity

Show:
TracBot
July 1, 2018, 12:21 AM
Trac Comment 2 by —2012-01-11T02:18:02.837Z

Many changes. Before going to the "Review" link consider starting from the wiki:Markus/ReviewTicket8972 "guided tour" page.

Markus Scherer
September 27, 2019, 5:44 PM

Copy of Trac wiki:Markus/ReviewTicket8972, last modified 2012-jan-10:

"Guided tour" for reviewing ticket #8972 -- simplify Unicode tools

Goal

Make it easier to run the ICU Unicode-data update tools. Fewer tools to call fewer times with shorter, simpler command lines. Make tools easier to maintain. Fewer manual steps. Intermediate generated files easier to diff & review.

See design doc ​Preparsed UCD

Overview

Old/ICU 4.8: Several Unicode tools; each parsed several UCD .txt files and merged the relevant properties and values; most tools wrote either a C source file (for hardcoded values in ICU4C) or a binary data file (for ICU4J), so to get both they had to be invoked twice. genpname (property & value aliases) had to be run first, then ICU rebuilt to pick up the new data, then the Unicode tools rebuilt (separate source tree), then the other tools run.

New/ICU 49: One tool for the UCD (genprops); parses one merged, simple-syntax file; writes all output files in one invocation; simplified command line arguments take paths to file tree roots rather than a larger number of longer paths; property & value aliases are generated first and then injected life into the merged-file parser to avoid one round of rebuilding.

The new file data/unidata/ppucd.txt has all of the Unicode properties and values that are used or implemented by ICU. Most of it is a list of Unicode code point ranges with a simple syntax (property_name=value) and inheritance (code point -> containing block -> defaults -> null values). It contains the property & value aliases data and lines for ranges of algorithmic character names. It includes ICU-specific properties such as tccc and Case_Sensitive, and unnamed Unicode properties such as Conditional_Case_Mappings and Turkic_Case_Folding.

Preparsed UCD

ppucd.txt is generated by the new tools/.../preparseucd.py. It started from the older ucdcopy.py which just collected (found, copied, removed version numbers from filenames) and lighly preprocessed the UCD .txt files (stripped inline comments from some files, merged lines in some into ranges).

preparseucd.py does much more than that. After preprocessing, it parses all of the files and stores the properties in an inversion map. It writes ppucd.txt with all of the properties. It writes the header file with the property & value aliases data, which used to be done by a Perl script. It writes the Normalizer2 input .txt files for NFC, NFKC & NFKC_CF, which used to be done by the gennorm tool plus (for NFKC_CF) a manual regex and editing.

The new script also requires less syntax in the uchar.h header file. See the ​modified comments in uchar.h.

ppucd.txt parser

New parser for the ppucd.txt file, see ​ppucd.h and ​ppucd.cpp.

Copies of UCD .txt files in the ICU source tree

Several of those files are not needed any more because they are parsed by preparseucd.py from a UCD download folder and not by any other code after that. The obsolete .txt files are deleted in changeset r31196.

For the remaining files that are still used by tests, I submitted ticket #9041.

genprops

genprops.cpp: UCD .txt parser replaced by calls to the ppucd parser and calls to the merged-in builders. A new path/to/ICU/src/root command-line argument replaces several options with longer values. See the ​diffs

New properties builder interface see ​genprops.h diffs.

Core properties builder recast to fit this interface, see ​diffs.

genpname

This tool takes preparsed data from a header file and writes the data files for property names and value names. The preparsed header file used to be generated by a fragile, hard-to-maintain Perl script (​genpname/preparse.pl) which has now been replaced by a portion of preparseucd.py (function WritePNamesDataHeader() and the functions it calls).

The generated header file was about three times as long as it needed to be, and adding one property name or value name caused most of the lines to change. It was moved to genprops and changed to a simpler, shorter, stable format. See ​diffs.

Builder code merged into genprops, see ​diffs

The runtime property (value) names data is injected into the ppucd.txt parser, avoiding the need to rebuild ICU and the tools before building further UCD data.

genbidi

Separate parser removed. Builder code merged into genprops, see ​diffs

gencase

Separate parser removed. Builder code merged into genprops, see ​diffs

gennames

Separate parser removed. Builder code merged into genprops, see ​diffs

In gennames, for each Unicode version, the CJK ranges had to be manually reviewed, adjusted and/or added. Now, the ranges of algorithmically computed character names are read from ppucd.txt (e.g., algnamesrange;3400..4DB5;han;CJK UNIFIED IDEOGRAPH-). preparseucd.py parses them automatically out of UnicodeData.txt.

gennorm

The gennorm tool (what was left of it since the Normalizer2 work) has been subsumed by preparseucd.py (function WriteNorm2() and the functions it calls) which writes the data/unidata/norm2/ files nfc.txt, nfkc.txt and nfkc_cf.txt.

gennorm2

gennorm2 now reads an optional line like "* Unicode 6.1" from the input .txt files. preparseucd.py generates those lines in nfc.txt and nfkc.txt, so that the Unicode version argument can be removed from the gennorm2 command lines. See changeset r31193.

genuca

Only a minor change: Rather than the source-dir and dest-dir options (and the optional FractionalUCA.txt argument), it now just takes the path/to/ICU/src/root argument like genprops.

After a UCD update and running genprops, we still need to rebuild ICU and the Unicode tools before running genuca so that it picks up the new case mapping properties and NFC normalization etc. Changing this would be "hard". I submitted ticket #9040 for it.

Miscellaneous

There are minor changes in other files. See the "Review" link for ticket #8972. The generated data files changed, but the data is equivalent; the different parsing order just results in the tries getting built differently.

Fixed

Assignee

Markus Scherer

Reporter

Markus Scherer

Components

None

Labels

None

Reviewer

None

Priority

major

Time Needed

Weeks

Fix versions