Iusses with Definitions of Unicode Sets in LDML Specification Version 35

Description

There are a number of problems with the syntax of Unicode Sets given in Section 5.3.3. I list them below under 7 headings.

1. Symbol: char; nature: obscurity.
There are references to EBNF, but no definition. It is not immediately clear that '
' means a single backslash.

2. Symbol: root; nature: garden path for parsers.
[:gc=Lo:] is ambiguous. It could be a set defined by a property, or it could be equivalent to [cgoL=:]. Even with a preference rule, we would still have garden paths, and what if the property gc were not supported?

While we should allow the 3-character sequence '[:]', I suggest that '[:' be otherwise banned except as an expansion of 'prop'. Thus references to unsupported properties will automatically become errors, rather than being misinterpreted.

3. Symbol: quoted; nature: garden path for parsers.
[\u0040] is formally ambiguous. It can mean [04u] as well as [A].

Is '[\u040]' really meant to mean the same as '[04u]'?

I suggest the expansion “[\u0000-\U00010FFFF]” be expanded in length and reduced in scope to “[[\u0000-\U00010FFFF]- [uUxN]]”.

\p and \P do not need to be excluded as expansions of char, for:
(1) '\p{' and '\P{' can only occur in the start of an expression of 'prop'.
(2) Unicode sets occur in strings rather than streams. so lookahead can be straightforward in hand-crafted parsers.

4. Symbols: quoted, propName; nature: Error?
\N notation is not supported for U+0F60 TIBETAN LETTER -A. (Note that there is U+0F68 TIBETAN LETTER A.) Should hyphen-minus be allowed in propName?

5. Symbols: range, char; nature: clarity.
The following meet the syntax of 'range':
A-\x{41 42}
\x{41 42}-D

Should they be interpreted respectively as
A-BC
BC-D
?

6. Symbols: range, char; nature: clarity.
In the light of the above, is
A-\x{41 42}-D
indeed a syntax error?

7. Section 5.3.3.1 "List of Code Points"; nature: contra­diction
For '\xhh', this table says '1-2 hex digits', but the previous table specifies exactly 2 digits. A preference rule would be needed to make a single hex digit unambiguous.

xpath

None

locale

None

Status

Assignee

Mark Davis

Reporter

Richard wordingham

Labels

None

tracReporter

None

tracOwner

None

tracResolution

None

tracStatus

None

Reviewer

None

phase

spec-beta

tracCc

None

tracCreated

None

Components

Fix versions

Priority

TBD