ICU needs to be lenient when parsing U+00A0 vs. U+0020 in numbers

Description

Deleted Component: formatting

Spaces in number formats (e.g., percent format for German) should almost always be nonbreaking spaces (U+00A0). CLDR is planning to implement this in 1.6. ICU needs to be lenient when parsing such numbers, accepting either U+00A0 or U+0020. This is just one case of lenient parsing support overall. Probably this should be applied to dates as well.

Activity

Show:
TracBot
July 1, 2018, 12:11 AM
Trac Comment 3 by —2007-12-20T21:52:35.000Z

(copying some history)

Here's how I'm using these in the initial code. Take the locale's decimal and grouping characters as before. Compute UnicodeSets for these based on the following (1) and (2). The code is changed to check for inclusion in decimalSet or groupingSet, instead of just testing against decimal or grouping.

1) UnicodeSet decimalSet = new UnicodeSet(getSimilarDecimals(decimal));
2) UnicodeSet groupingSet = new UnicodeSet(defaultGroupingSeparators).add(grouping).removeAll(decimalSet);

As a result of 1-2, we are guaranteed that decimalSet contains the decimal, and groupingSet contains the groupingSeparator (unless decimal and grouping are the same, which should never happen. But in that case, groupingSet will just be empty.).

getSimilarDecimals looks like this:

private UnicodeSet getSimilarDecimals(char decimal) {
if (dotEquivalents.contains(decimal)) return dotEquivalents;
if (commaEquivalents.contains (decimal)) return commaEquivalents;
// if there is no match, return the character itself
return new UnicodeSet().add(decimal);
}

Thus suppose we start with decimal = comma and grouping = period. Here's what happens.

1) decimalSet becomes comma, arabic comma, arabic thousands (which looks like a comma), idographic comma, ...
2) groupingSet becomes period, apostrophe, spaces, ...

=========================

Here are the actual characters (so far)

dotEquivalents: [.\u2024\u3002\uFE12\uFE52\uFF0E\uFF61]

002E # Po FULL STOP
2024 # Po ONE DOT LEADER
3002 # Po IDEOGRAPHIC FULL STOP
FE12 # Po PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC FULL STOP
FE52 # Po SMALL FULL STOP
FF0E # Po FULLWIDTH FULL STOP
FF61 # Po HALFWIDTH IDEOGRAPHIC FULL STOP

commaEquivalents: [,\u060C\u066B\u3001\uFE10\uFE11\uFE50\uFE51\uFF0C\uFF64]

002C # Po COMMA
060C # Po ARABIC COMMA
066B # Po ARABIC DECIMAL SEPARATOR
3001 # Po IDEOGRAPHIC COMMA
FE10..FE11 # Po [2] PRESENTATION FORM FOR VERTICAL COMMA..PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC COMMA
FE50..FE51 # Po [2] SMALL COMMA..SMALL IDEOGRAPHIC COMMA
FF0C # Po FULLWIDTH COMMA
FF64 # Po HALFWIDTH IDEOGRAPHIC COMMA

otherGroupingSeparators: ['\u00A0\u066C\u2000-\u200A\u2018\u2019\u202F\u205F\u3000\uFF07|]
0020 # Zs SPACE
0027 # Po APOSTROPHE
00A0 # Zs NO-BREAK SPACE
066C # Po ARABIC THOUSANDS SEPARATOR
2000..200A # Zs [11] EN QUAD..HAIR SPACE
2018 # Pi LEFT SINGLE QUOTATION MARK
2019 # Pf RIGHT SINGLE QUOTATION MARK
202F # Zs NARROW NO-BREAK SPACE
205F # Zs MEDIUM MATHEMATICAL SPACE
3000 # Zs IDEOGRAPHIC SPACE
FF07 # Po FULLWIDTH APOSTROPHE

defaultGroupingSeparators = dotEquivalents + commaEquivalents + otherGroupingSeparators

TracBot
July 1, 2018, 12:11 AM
Trac Comment 4 by —2007-12-20T21:55:29.000Z

I made 2 further changes.

1. Ideographic characters and Arabic comma are only accepted on in non-strict.

2. Once I see a grouping or decimal separator, I restrict the sets (respectively) to only the character I saw. So you can't have "12,345 678.9", even if both space and comma are in the grouping separator set.

TracBot
July 1, 2018, 12:11 AM
Trac Comment 6 by —2007-12-21T21:26:08.000Z

Needs separate bug to do C++.

TracBot
July 1, 2018, 12:11 AM
Trac Comment 9 by —2016-10-05T23:16:54.399Z

Milestone 3.9.2 deleted

Fixed

Assignee

Mark Davis

Reporter

TracBot

Components

None

Labels

None

Reviewer

None

Priority

major

Time Needed

Days

Fix versions