Export UCPTrie data to array buffers
Description
relates to
Activity
Reopened: Broke the Bazel build for Unicode data files. Fix in https://github.com/unicode-org/icu/pull/1883
Script_Extensions maps a code point to a set of Script values. Storage-wise, you want to optimize for that set to contain a single Script value, and usually the same as the Script property value, because 99+% of code points have Script_Extensions(c) = {Script(c)}.
In ICU data, from the CodePointTrie result I get a 12-bit value for both Script and Script_Extensions. The top 2 bits tell me whether it’s the simple case; if not, the other 10 bits point into an additional array with the Script_Extensions. As a further optimization for the Script value lookup, the top 2 bits also tell me whether the Script is Common or Inherited – if not, then the main Script also comes from the extra array. See https://github.com/unicode-org/icu/blob/main/icu4c/source/common/uprops.h
* Properties in vector word 0
* Bits
* 23..22 3..1: Bits 21..20 & 7..0 = Script_Extensions index
* 3: Script value from Script_Extensions
* 2: Script=Inherited
* 1: Script=Common
* 0: Script=bits 21..20 & 7..0
* 21..20 Bits 9..8 of the UScriptCode, or index to Script_Extensions
* 7.. 0 UScriptCode, or index to Script_Extensions
In a regex, what you need is \p{scx=Arab}
, corresponding to ICU’s uscript_hasScript(codePoint, script)
/ UScript.hasScript(codePoint, script)
API. “Do the Script_Extensions of code point c contain script sc?“
You could store a UnicodeSet for \p{scx=sc}
for each Script sc, but the data would be very redundant with the Script data. (Especially since ES regex has both the Script and Script_Extensions properties.)
ICU enumerates the trie at runtime (with some optimization) to find the per-script set from the combined Script+Script_Extensions trie+array data.
Iain raised in the PR review thread that icuwriteuprops does not currently support Script_Extensions. We will likely need to add another code path for Script_Extensions that makes use of the specialized uscript_getScriptExtensions function.
@Markus Scherer -- what is a good data model to use to export Script_Extensions?
Documenting discussion with @Markus Scherer from email:
I asked:
I would like to extract the CodePointTrie buffers for enumerated Unicode properties from ICU4C and ship them in their own data file.
I envision this being a standalone binary tool, perhaps similar to icupkg, that, when invoked, loads ICU4C and dumps the needed buffers into a JSON file. Running the tool should be part of CI, Bazel, and/or the BRS release tasks.
My question is, where do you envision to be the best place for such a tool to plug into ICU4C to get all the information it needs?
Markus replied:
One version is to read the relevant .icu binary files directly and use their data structures, where one trie typically encodes multiple properties.
Pro: You can just grab ICU's data files.
Con: You have to closely follow along as the ICU data file formats occasionally change.This might make sense for the various case mapping properties:
https://unicode-org.github.io/icu/userguide/icudata.html#unicode-character-data-case-mappings-for-java-only-hardcoded-in-c-common-librarySee for example
https://github.com/unicode-org/icu/blob/master/icu4j/main/classes/core/src/com/ibm/icu/impl/UCaseProps.java#L1501This might not make sense for the hodge-podge of properties in
https://unicode-org.github.io/icu/userguide/icudata.html#unicode-character-data-properties-for-java-only-hardcoded-in-c-common-library
So another version is to write a tool that calls public ICU API to enumerate property values, such as
U_CAPI const USet * U_EXPORT2
u_getBinaryPropertySet(UProperty property, UErrorCode *pErrorCode);
and
U_CAPI const UCPMap * U_EXPORT2
u_getIntPropertyMap(UProperty property, UErrorCode *pErrorCode);And build a new UMutableCPTrie-->UCPTrie, and then serialize that as you see fit.
As a tool in the regular build process, it would want to go into
https://github.com/unicode-org/icu/tree/master/icu4c/source/tools
ICU4X needs code point trie data (both tests and for properties) exported to an easily consumable format for downstream use.
See:
https://github.com/unicode-org/icu4x/issues/509