We're updating the issue view to help you get more done. 

sets & maps for Unicode properties

Description

For binary (true/false) properties, users can get a UnicodeSet that represents the whole property across all of Unicode. They can use a UnicodeSet pattern or UnicodeSet.applyIntPropertyValue(property, 0|1). It would be sometimes convenient to have an API to get an immutable (frozen) set for the property.

For enumerated properties, where we map each code point to the int value of a value enum constant, we don’t have anything like this. You can only ask getIntPropertyValue(code point, property) to get the value for each code point. There is no efficient way to get the values for all code points.

ICU 63 adds a Java base class CodePointMap with an immutable subclass CodePointTrie (and nested subclasses for subtypes thereof).

I would like to add API like getIntPropertyMap(property) which returns a whole CodePointMap. I would not promise a particular implementation type, but the Map has a per-code point value getter and range iteration functions. Users could easily and efficiently use this to build other, custom maps.

What about C? (Properties APIs are usually C, not C++, mostly to avoid duplication.) In C, I added UCPTrie which is equivalent to Java CodePointTrie, but since there is no polymorphism in C, there is no equivalent to CodePointMap.

We could add u_getIntPropertyMap(property) returning a const UCPMap, and cache it as a singleton (one per property). UCPMap would be an opaque type with a few functions like ucpmap_get(map, c) and ucpmap_getRange(map, start, ...) parallel with UCPTrie. There would of course not be any polymorphic relationship. Internally it may or may not wrap a UCPTrie in some way.

If we do it quickly, I could rename the new enum UCPTrieRangeOption to UCPMapRangeOption, and the new UCPTrieValueFilter to UCPMapValueFilter, to avoid confusion and duplication.

Should probably also add umutablecptrie_fromUCPMap(map).

Environment

Status

Assignee

Markus Scherer

Reporter

Markus Scherer

Labels

Components

Fix versions

Priority

medium