The java.util.regex.Pattern documentation claims that the POSIX character class "Punct"
contains the characters:
Empirically, the following characters are not considered in Punct on ICU.
Grep seems to treat them as part of unct:. So perhaps this is a bug in ICU? We need to figure out whether that is the case and how to fix it.
GoogleIssue: 111497078
Closed with a combination of fixed, for the docs update, and working as designed, for the behavior.
Where is the docs update? It do not see it on this issue.
There are no commits on this ticket. Where is the docs update?
The Userguide change is to the section https://sites.google.com/site/icuprojectuserguide/strings/regexp#TOC-Differences-with-Java-Regular-Expressions, the addition of the last item in the list of differences with Java Regular Expressions.
The property expression \p{punct} differs in what it matches. Java matches matches any of !"#$%&'()*+,-./:;<=>?@[]^_`{|}~. From that list, ICU omits $+<=>^`|~
ICU follows the recommendations from Unicode UTS-18, http://unicode.org/reports/tr18/#Compatibility_Properties. See also https://unicode-org.atlassian.net/browse/ICU-20095
Looks good, thanks. When you said "docs" I had assumed you meant an API Docs change, so I was worried that I couldn't find the commit for that. Thanks for changing the ticket type to "User Guide".