Add some processing details to ubrk/brkiter documentation
General
Other Data
General
Other Data
Description
There are a few processing details which would be useful to add to the ubrk/brkiter documentation:
1. Cursors that point into the middle of a segment are supported and expected 2. If the cursor points into the middle of a surrogate pair, next()/following()/previous()/preceding() will move the cursor to the beginning of the surrogate pair before any other processing. 3. If the cursor then points into the middle of a segment, next()/following() will move to the end of that segment, and previous()/preceding() will move to the beginning of that segment. 4. Otherwise, the cursor points to a segment boundary. next()/following() will move to the end of the next segment, and previous()/preceding() will move to the beginning of the previous segment. 5. There is at least one situation where next()/following()/previous()/preceding() may scan an unbounded number of characters before the cursor (far beyond, even, the distance between the current cursor and the returned cursor). One example is a long string of flag emoji. In order to know whether the cursor is in the middle of a single flag or between adjacent flags, the implementation has to count how many regional indicator symbols occur in the string before the cursor, to determine if the count is odd or even.
Activity
Show:
Myles C. Maxfield
January 19, 2022 at 8:57 PM
Yes, of course. I’ll try to get to it this week.
Markus Scherer
January 19, 2022 at 8:26 PM
Thanks, @Myles C. Maxfield – would you be willing to send us a pull request, ideally for both C++ and Java API docs, or alternatively for the User Guide?
Myles C. Maxfield
January 14, 2022 at 6:41 PM
(edited)
I guess it might be valuable to mention which segment space characters belong to
Myles C. Maxfield
January 14, 2022 at 6:40 PM
A couple follow-ups:
If the cursor points between two segments, those two segments are usually called the “previous & current segments”, not the “previous & next segments.”
I should also describe what it means for the cursor to point between two segments, when the value returned is a string index. How can a string index point in between string elements?
I should say that jumping to the beginning/end of a cluster from the middle is always well-defined.
There are a few processing details which would be useful to add to the ubrk/brkiter documentation:
1. Cursors that point into the middle of a segment are supported and expected
2. If the cursor points into the middle of a surrogate pair, next()/following()/previous()/preceding() will move the cursor to the beginning of the surrogate pair before any other processing.
3. If the cursor then points into the middle of a segment, next()/following() will move to the end of that segment, and previous()/preceding() will move to the beginning of that segment.
4. Otherwise, the cursor points to a segment boundary. next()/following() will move to the end of the next segment, and previous()/preceding() will move to the beginning of the previous segment.
5. There is at least one situation where next()/following()/previous()/preceding() may scan an unbounded number of characters before the cursor (far beyond, even, the distance between the current cursor and the returned cursor). One example is a long string of flag emoji. In order to know whether the cursor is in the middle of a single flag or between adjacent flags, the implementation has to count how many regional indicator symbols occur in the string before the cursor, to determine if the count is odd or even.