Provide a RegexMatcher::find overload taking end position
General
Other Data
General
Other Data
Description
Please implement a RegexMatcher::find overload taking both start and end position in addition to already existing overload only taking start position.
Consider this use case: an application (LibreOffice) provides an extended Find&Replace capability allowing user to search using regex in text having specified set of attributes. So the program finds text runs with specified attributes (say, bold) first, then calls ICU's RegexMatcher methods to match the passed regular expression.
If LibreOffice only passes the text run that matches the attributes to RegexMatcher, then both false positives and false negatives may happen. E.g., for this text:
A short paragraph with bold text in it.
and a search for regex t$ with attributes bold, if LibreOffice cuts and passes only the text with required attributes to RegexMatcher (in this case, "bold text"), then the trailing "t" will be matched, although the character is not the last character in the paragraph: a false positive. On the other hand, a regex with a look-ahead assertion would result in a false negative: a search for bold text using regex t(?= in), since the trailing text having "incorrect" attribute set is not the part of the string passed to RegexMatcher.
On the other hand, passing the whole text to the engine would naturally result in other wrong matches: say, in a search for bold text with regex .+\s, passing the whole "A short paragraph with bold text in it." to RegexMatcher, specifying the starting point to be "b", the matching result would be "bold text in ", whereas it should had been "bold ".
It could be solved, if there were an overload of RegexMatcher::find taking not only start position, but also end position. Then the engine could correctly stop at the end position, having enough context to evaluate look-ahead assertions or paragraph ends.
Activity
Markus Scherer March 4, 2020 at 11:44 PM
Reporter found existing API for what they need.
Mike Kaganski February 28, 2020 at 7:13 AM
Oh, obviously this is INVALID and is only a sign of me being ignorant. Just using the region() which is there for ~ever solver the use case. I don’t know how to close INVALID; sorry for the noise!
Please implement a RegexMatcher::find overload taking both start and end position in addition to already existing overload only taking start position.
Consider this use case: an application (LibreOffice) provides an extended Find&Replace capability allowing user to search using regex in text having specified set of attributes. So the program finds text runs with specified attributes (say, bold) first, then calls ICU's RegexMatcher methods to match the passed regular expression.
If LibreOffice only passes the text run that matches the attributes to RegexMatcher, then both false positives and false negatives may happen. E.g., for this text:
and a search for regex t$ with attributes bold, if LibreOffice cuts and passes only the text with required attributes to RegexMatcher (in this case, "bold text"), then the trailing "t" will be matched, although the character is not the last character in the paragraph: a false positive. On the other hand, a regex with a look-ahead assertion would result in a false negative: a search for bold text using regex t(?= in), since the trailing text having "incorrect" attribute set is not the part of the string passed to RegexMatcher.
On the other hand, passing the whole text to the engine would naturally result in other wrong matches: say, in a search for bold text with regex .+\s, passing the whole "A short paragraph with bold text in it." to RegexMatcher, specifying the starting point to be "b", the matching result would be "bold text in ", whereas it should had been "bold ".
It could be solved, if there were an overload of RegexMatcher::find taking not only start position, but also end position. Then the engine could correctly stop at the end position, having enough context to evaluate look-ahead assertions or paragraph ends.