ICU doesn't consider these characters as punctuation (punct): $+<=>^`|~

Description

The java.util.regex.Pattern documentation claims that the POSIX character class "Punct"

contains the characters:

Empirically, the following characters are not considered in Punct on ICU.

Grep seems to treat them as part of unct:. So perhaps this is a bug in ICU? We need to figure out whether that is the case and how to fix it.

GoogleIssue: 111497078

Activity

Show:
Shane Carr
February 21, 2019, 10:11 PM

Looks good, thanks. When you said "docs" I had assumed you meant an API Docs change, so I was worried that I couldn't find the commit for that. Thanks for changing the ticket type to "User Guide".

Andy Heninger
February 21, 2019, 10:09 PM
Edited

The Userguide change is to the section https://sites.google.com/site/icuprojectuserguide/strings/regexp#TOC-Differences-with-Java-Regular-Expressions, the addition of the last item in the list of differences with Java Regular Expressions.

The property expression \p{punct} differs in what it matches. Java matches matches any of !"#$%&'()*+,-./:;<=>?@[]^_`{|}~. From that list, ICU omits $+<=>^`|~
ICU follows the recommendations from Unicode UTS-18, http://unicode.org/reports/tr18/#Compatibility_Properties. See also https://unicode-org.atlassian.net/browse/ICU-20095

Shane Carr
February 21, 2019, 6:57 AM

There are no commits on this ticket. Where is the docs update?

Shane Carr
February 21, 2019, 2:28 AM

Where is the docs update? It do not see it on this issue.

Andy Heninger
January 5, 2019, 12:55 AM

Closed with a combination of fixed, for the docs update, and working as designed, for the behavior.

Fixed

Assignee

Andy Heninger

Reporter

Victor Chang

Components

Labels

Reviewer

Victor Chang

Priority

major

Fix versions