Uploaded image for project: 'CLDR'
  1. CLDR-3962

wordbreaker (\b) not working with regular expressions in List-UnicodeSet tool

    Details

    • Type: Bug
    • Status: Done (View workflow)
    • Priority: medium
    • Resolution: Unresolved
    • Affects versions: None
    • Fix versions: cleanup
    • Components: None
    • Labels:
      None

      Description

      Deleted Component: utilities-online

      On http://cldr.unicode.org/unicode-utilities/list-unicodeset
      the document says that \b (word break) is supported for the character name and block name properties (and it gives examples of these).

      However, \b only matches at the front and end of the property value string (sort of combination of caret and dollar metacharacters). There's still no actual support for matching word breaks, or the word breaker implemented in the tool is incorrectly instanciated or is not implemented at all (just a stub, even though it should be simple here for such properties whose values are restricted to ASCII and Basic Latin letters).

      See the tests with the word "Latin" that occurs in several block names (always exactly at word woundaries)

      (1) basic test (ignoring word boundaries) :

      http://unicode.org/cldr/utility/list-unicodeset.jsp?a=/p

      {Block:/Latin/}

      1104 code points returned. No problem here.

      (2) tests with boundary at end of the query:

      http://unicode.org/cldr/utility/list-unicodeset.jsp?a=/p

      {Block:/Latin\b/}

      only 128 codepoints returned, i.e. in the "Basic Latin" (ASCII) block only, but not all other blocks for Latin extension blocks. This behaves exactly like:

      http://unicode.org/cldr/utility/list-unicodeset.jsp?a=/p

      {Block:/Latin$/}

      i.e. \b only maches only at end of the property value

      (3) tests with boundary at beginning of the query:

      http://unicode.org/cldr/utility/list-unicodeset.jsp?a=/p

      {Block:/\bLatin/}

      only 976 codepoints returned, i.e. in all blocks for Latin extensions, but not the "Basic Latin" (ASCII) block. This behaves exactly like:

      http://unicode.org/cldr/utility/list-unicodeset.jsp?a=/p

      {Block:/^Latin/}

      i.e. \b only maches here only at begining of the property value

      (4) tests with boundary at beginning and end of the query:

      http://unicode.org/cldr/utility/list-unicodeset.jsp?a=/p

      {Block:/\bLatin\b/}

      No code points found ! This behaves exactly like:

      http://unicode.org/cldr/utility/list-unicodeset.jsp?a=/p

      {Block:/^Latin$/}

      or

      http://unicode.org/cldr/utility/list-unicodeset.jsp?a=/p

      {Block:Latin}

      and it returns nothing because there's no block named only "Latin"

        Attachments

          Activity

            People

            • Assignee:
              mark.edward.davis Mark Davis
              Reporter:
              apibot TracBot
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                tracCreated: