Export UCPTrie data to array buffers

Script_Extensions maps a code point to a set of Script values. Storage-wise, you want to optimize for that set to contain a single Script value, and usually the same as the Script property value, because 99+% of code points have Script_Extensions(c) = {Script(c)}.

In ICU data, from the CodePointTrie result I get a 12-bit value for both Script and Script_Extensions. The top 2 bits tell me whether it’s the simple case; if not, the other 10 bits point into an additional array with the Script_Extensions. As a further optimization for the Script value lookup, the top 2 bits also tell me whether the Script is Common or Inherited – if not, then the main Script also comes from the extra array. See https://github.com/unicode-org/icu/blob/main/icu4c/source/common/uprops.h

 * Properties in vector word 0
 * Bits
 * 23..22   3..1: Bits 21..20 & 7..0 = Script_Extensions index
 *             3: Script value from Script_Extensions
 *             2: Script=Inherited
 *             1: Script=Common
 *             0: Script=bits 21..20 & 7..0
 * 21..20   Bits 9..8 of the UScriptCode, or index to Script_Extensions
 *  7.. 0   UScriptCode, or index to Script_Extensions

In a regex, what you need is \p{scx=Arab} , corresponding to ICU’s uscript_hasScript(codePoint, script) / UScript.hasScript(codePoint, script) API. “Do the Script_Extensions of code point c contain script sc?“

You could store a UnicodeSet for \p{scx=sc} for each Script sc, but the data would be very redundant with the Script data. (Especially since ES regex has both the Script and Script_Extensions properties.)

ICU enumerates the trie at runtime (with some optimization) to find the per-script set from the combined Script+Script_Extensions trie+array data.

Shane Carr

August 31, 2021 at 12:27 AM

Iain raised in the PR review thread that icuwriteuprops does not currently support Script_Extensions. We will likely need to add another code path for Script_Extensions that makes use of the specialized uscript_getScriptExtensions function.

@Markus Scherer -- what is a good data model to use to export Script_Extensions?

Shane Carr

April 23, 2021 at 11:55 PM

Documenting discussion with @Markus Scherer from email:

I asked:

I would like to extract the CodePointTrie buffers for enumerated Unicode properties from ICU4C and ship them in their own data file.
I envision this being a standalone binary tool, perhaps similar to icupkg, that, when invoked, loads ICU4C and dumps the needed buffers into a JSON file. Running the tool should be part of CI, Bazel, and/or the BRS release tasks.
My question is, where do you envision to be the best place for such a tool to plug into ICU4C to get all the information it needs?

Markus replied:

One version is to read the relevant .icu binary files directly and use their data structures, where one trie typically encodes multiple properties.
Pro: You can just grab ICU's data files.
Con: You have to closely follow along as the ICU data file formats occasionally change.
This might make sense for the various case mapping properties:
https://unicode-org.github.io/icu/userguide/icudata.html#unicode-character-data-case-mappings-for-java-only-hardcoded-in-c-common-library
See for example
https://github.com/unicode-org/icu/blob/master/icu4j/main/classes/core/src/com/ibm/icu/impl/UCaseProps.java#L1501
This might not make sense for the hodge-podge of properties in
https://unicode-org.github.io/icu/userguide/icudata.html#unicode-character-data-properties-for-java-only-hardcoded-in-c-common-library

So another version is to write a tool that calls public ICU API to enumerate property values, such as
U_CAPI const USet * U_EXPORT2
u_getBinaryPropertySet(UProperty property, UErrorCode *pErrorCode);
and
U_CAPI const UCPMap * U_EXPORT2
u_getIntPropertyMap(UProperty property, UErrorCode *pErrorCode);
And build a new UMutableCPTrie-->UCPTrie, and then serialize that as you see fit.
As a tool in the regular build process, it would want to go into
https://github.com/unicode-org/icu/tree/master/icu4c/source/tools

Resize work item view side panel

Fixed

Details

Assignee

Shane Carr

Reporter

Elango Cheran

Components

properties

Labels

icu4x

Priority

major

Time Needed

Hours

Fix versions

70.1

Created March 18, 2021 at 11:02 PM

Updated November 30, 2021 at 10:00 PM

Resolved September 28, 2021 at 12:51 AM