ICU doesn't consider these characters as punctuation (punct): $+<=>^`|~

Description

The java.util.regex.Pattern documentation claims that the POSIX character class "Punct"

contains the characters:

Empirically, the following characters are not considered in Punct on ICU.

Grep seems to treat them as part of unct:. So perhaps this is a bug in ICU? We need to figure out whether that is the case and how to fix it.

GoogleIssue: 111497078

Activity

Show:
Andy Heninger
January 5, 2019, 12:55 AM

Closed with a combination of fixed, for the docs update, and working as designed, for the behavior.

Shane Carr
February 21, 2019, 2:28 AM

Where is the docs update? It do not see it on this issue.

Shane Carr
February 21, 2019, 6:57 AM

There are no commits on this ticket. Where is the docs update?

Andy Heninger
February 21, 2019, 10:09 PM
Edited

The Userguide change is to the section https://sites.google.com/site/icuprojectuserguide/strings/regexp#TOC-Differences-with-Java-Regular-Expressions, the addition of the last item in the list of differences with Java Regular Expressions.

The property expression \p{punct} differs in what it matches. Java matches matches any of !"#$%&'()*+,-./:;<=>?@[]^_`{|}~. From that list, ICU omits $+<=>^`|~
ICU follows the recommendations from Unicode UTS-18, http://unicode.org/reports/tr18/#Compatibility_Properties. See also https://unicode-org.atlassian.net/browse/ICU-20095

Shane Carr
February 21, 2019, 10:11 PM

Looks good, thanks. When you said "docs" I had assumed you meant an API Docs change, so I was worried that I couldn't find the commit for that. Thanks for changing the ticket type to "User Guide".

Fixed

Assignee

Andy Heninger

Reporter

Victor Chang

Components

Labels

Reviewer

Victor Chang

Priority

major

Time Needed

None

Fix versions

Configure