Make Identifier_Status & Identifier_Type be regular ICU properties

Description

Make the Idmod_Status and Idmod_Type from https://www.unicode.org/reports/tr39/#General_Security_Profile be regular properties, so that people can use them more effectively, //and// remove the @internal hacks:

public static final UnicodeSet INCLUSION public static final UnicodeSet RECOMMENDED

Original ticket title: Make Idmod_Status and Idmod_Type be regular ICU properties

These were later formalized with property names (Identifier_Status & Identifier_Type) and value names. See https://www.unicode.org/reports/tr39/#Identifier_Status_and_Type

Note that a code point’s Identifier_Type value is a set of the values listed for that property. We probably need to treat it like Script_Extensions, or like the ICU-specific General_Category_Mask; and we might need an addition u_hasIdentifierType(c, type) function. In data, Not_Character probably wants to be 0. Inclusion and Recommended are single values/bit combinations. Other values feed into Identifier_Status=Restricted.

Activity

Show:

Mark Davis 
March 12, 2024 at 12:39 AM

Do you have a concrete use case for the Identifier_Type, or do you mostly care about the Identifier_Status?

People do use IdentifierStatus, but I’d have to track them down. Of course, 5 of the enum values can be computed from other properties, but others can’t.

Until such time as we have script metaproperties, the only way to find out which characters are Excluded or Limited Use is this property. So if an implementer wants to have a custom version of Allowed that permits a particular Limited_Use script, then they can use the Identifier_Type to do that: add characters from the script that are not excluded by another Restricted category.

Similarly, if someone had a more or less restrictive definition of identifiers than XID, then that could be handled in the same way.

I haven’t seen any case where it is useful to know the difference between Obsolete, Technical, and Uncommon_Use; the distinctions are more useful for maintainers. So if we were going to do anything, merging those groups externally would be reasonable.

And why did we make the Identifier_Status an enumerated property, rather than a binary Identifier_Allowed?

Just history; I think early on there was some question about whether to add gradations, but we ended up just being binary.

Markus Scherer 
March 11, 2024 at 9:57 PM

FYI starting work on an implementation.

Also FYI PAG issue 217 “Strange Identifier_Type combinations“ includes a recommendation to “Change the Identifier_Type of A9CF to Limited_Use Uncommon_Use, removing Exclusion“ in Unicode 16 because Exclusion and Limited_Use are exclusive with each other.

Markus Scherer 
December 27, 2023 at 6:49 PM

I am trying to figure out the data model for the properties. There seem to be enough constraints for encoding them in 6-8 bits of per-character lookup value. However, I am getting push-back from Asmus Freytag and Ken Whistler about the design and usefulness of the Identifier_Type.

Do you have a concrete use case for the Identifier_Type, or do you mostly care about the Identifier_Status?

And why did we make the Identifier_Status an enumerated property, rather than a binary Identifier_Allowed?

UnicodeBot 
June 30, 2018 at 11:40 PM

Trac Comment 3 by —2015-02-05T21:47:45.981Z

Mark just proposed to the UTC to rename and split some of the Type values; wait until the renaming is settled.

See discussion "questions on Make Idmod_Status and Idmod_Type be regular ICU properties" on the ICU team mailing list about what properties and values and names we want exactly.

UnicodeBot 
June 30, 2018 at 11:40 PM

Trac Comment 2 by —2015-02-05T20:38:04.924Z

3.1 General Security Profile for Identifiers

The file idmod provides data for a profile of identifiers in environments where security is at issue. The file contains a set of characters recommended to be restricted from use. It also contains a small set of characters that are recommended as additions to the list of characters defined by the XID_Start and XID_Continue properties, because they may be used in identifiers in a broader context than programming identifiers. ...

In the file `[idmod]`, Field 1 is the character in question, Field 2 is a Status (either restricted or allowed), and Field 3 is a Type. The Types are subcategories of the Status value, and are listed in Table 1, [Identifier Modification Key|http://unicode.org/reports/tr39 (https://unicode-org.atlassian.net/browse/ICU-13#icft=ICU-13)/#Identifier_Modification_Key]

Fixed

Details

Assignee

Reporter

Priority

Time Needed

Days

Fix versions

Created June 28, 2018 at 5:23 PM
Updated March 20, 2024 at 8:20 PM
Resolved March 20, 2024 at 8:20 PM

Flag notifications