ICU doesn't consider these characters as punctuation (punct): $+<=>^`|~


The java.util.regex.Pattern documentation claims that the POSIX character class "Punct"

contains the characters:

Empirically, the following characters are not considered in Punct on ICU.

Grep seems to treat them as part of unct:. So perhaps this is a bug in ICU? We need to figure out whether that is the case and how to fix it.

Shane Carr
February 21, 2019, 10:11 PM

Looks good, thanks. When you said "docs" I had assumed you meant an API Docs change, so I was worried that I couldn't find the commit for that. Thanks for changing the ticket type to "User Guide".

Andy Heninger
February 21, 2019, 10:09 PM

The Userguide change is to the section, the addition of the last item in the list of differences with Java Regular Expressions.

The property expression \p{punct} differs in what it matches. Java matches matches any of !"#$%&'()*+,-./:;<=>?@[]^_`{|}~. From that list, ICU omits $+<=>^`|~
ICU follows the recommendations from Unicode UTS-18, See also

Shane Carr
February 21, 2019, 6:57 AM

There are no commits on this ticket. Where is the docs update?

Shane Carr
February 21, 2019, 2:28 AM

Where is the docs update? It do not see it on this issue.

Andy Heninger
January 5, 2019, 12:55 AM

Closed with a combination of fixed, for the docs update, and working as designed, for the behavior.



