runtime Normalizer2 builder

Description

We could move the Normalizer2 ([ICU-unknown]) builder class from the gennorm2 tool to the runtime library. With that, users could build Normalizer2 ([ICU-unknown]) data at runtime rather than having to prebuild custom data. We would have to change the builder somewhat, for example to report errors not via fprintf(), and design a nice API.

Most convenient would be if we could add (addAll() similar to UnicodeSet?) the data from an existing Normalizer2 ([ICU-unknown]) instance (must be an instance of Normalizer2 ([ICU-unknown])Impl or maybe a FilteredNormalizer2 ([ICU-unknown]) that has a Normalizer2 ([ICU-unknown])Impl), so that neither ICU nor the user need to carry duplicate data, or a text file and a parser. This would be great when the normalization is only a small modification of standard data, by adding or removing a small or otherwise algorithmically derivable set of mappings, and/or by filtering the data (rather than runtime-filtering with a FilteredNormalizer2 ([ICU-unknown])). We could even try to build the nfkc_cf.nrm data at runtime. (Saves space but costs initialization time.)

The problem with adding data from an existing Normalizer2 ([ICU-unknown]) instance and its .nrm file is recovering decomposition mappings in their original form, when they have been recursively decomposed in the actual data.

The following types of mappings are easily recoverable from a .nrm file:

  • a mapping that has not been recursively decomposed

  • a 2-way mapping: recover from composition data

  • a 1-way mapping stored as an algorithmic mapping delta:
    it is stored without recursive decomposition anyway

  • a 1-way mapping that has been recursively decomposed only via 2-way mappings:
    recompose the mapping

The unrecoverable mappings, which should only be 1-way or 2-way mappings that have been recursively decomposed with at least one 1-way mapping, would need to be added to the .nrm data file. If the data is generally small, then maybe we could add this data unconditionally.

Activity

Show:
TracBot
July 1, 2018, 12:03 AM
Trac Comment 1 by —2011-11-18T22:56:50.477Z

Mark wrote in duplicate ticket #8376:

2. In my tooling for Unicode testing, I would like to be able to create the data format at runtime, so that I can modify the results at runtime and rebuild a new Normalizer2 ([ICU-unknown]). So I'd like to have API like:

Assignee

Markus Scherer

Reporter

Markus Scherer

Components

Labels

None

Reviewer

None

Priority

zero

Time Needed

Weeks

Fix versions

None