We're updating the issue view to help you get more done. 

wordbreaker (\b) not working with regular expressions in List-UnicodeSet tool

Description

Deleted Component: utilities-online

On http://cldr.unicode.org/unicode-utilities/list-unicodeset
the document says that \b (word break) is supported for the character name and block name properties (and it gives examples of these).

However, \b only matches at the front and end of the property value string (sort of combination of caret and dollar metacharacters). There's still no actual support for matching word breaks, or the word breaker implemented in the tool is incorrectly instanciated or is not implemented at all (just a stub, even though it should be simple here for such properties whose values are restricted to ASCII and Basic Latin letters).

See the tests with the word "Latin" that occurs in several block names (always exactly at word woundaries)

(1) basic test (ignoring word boundaries) :

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=/p{Block:/Latin/}

1104 code points returned. No problem here.

(2) tests with boundary at end of the query:

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=/p{Block:/Latin\b/}

only 128 codepoints returned, i.e. in the "Basic Latin" (ASCII) block only, but not all other blocks for Latin extension blocks. This behaves exactly like:

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=/p{Block:/Latin$/}

i.e. \b only maches only at end of the property value

(3) tests with boundary at beginning of the query:

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=/p{Block:/\bLatin/}

only 976 codepoints returned, i.e. in all blocks for Latin extensions, but not the "Basic Latin" (ASCII) block. This behaves exactly like:

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=/p{Block:/^Latin/}

i.e. \b only maches here only at begining of the property value

(4) tests with boundary at beginning and end of the query:

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=/p{Block:/\bLatin\b/}

No code points found ! This behaves exactly like:

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=/p{Block:/^Latin$/}

or

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=/p{Block:Latin}

and it returns nothing because there's no block named only "Latin"

Environment

xpath

None

locale

None

Status

Assignee

Mark Davis

Reporter

TracBot

tracReporter

verdy_p@abeef3a88dc95339

tracOwner

mark

tracResolution

moved-to-unicodetools

tracStatus

closed

tracCreated

Jun 07, 2011, 7:09 AM

Fix versions

Priority

medium