We're updating the issue view to help you get more done. 

Word Boundaries

Description

The XML Query group is looking at word boundaries (be careful to distinguish
this from line-break boundaries. One of the members looked at the ICU4J rules,
and had some comments. There are some oddities that we need to address, for
example, the quotation mark definitely looks like a bug. (I don't agree with all
of what he says, however.)

In fixing these, there is another issue: right now we would not have any breaks
within Chinese, or Thai, and hardly any in Japanese. While a full solution would
require more work, I think a better default behavior would be:

  • break ideographs into grapheme clusters

  • invoke Thai linebreak if we ever get Thai character.

============
Message-ID: <4EDD23A3F6B4D411B7DF00A0C9DD5B560166E6CE@MSGBOS626NTS.fmr.com>
From: "Tolkin, Steve" <Steve.Tolkin@FMR.COM>
To: w3c-query-operators@w3.org
Date: Thu, 10 Jan 2002 17:38:22 -0500
Subject: RE: FTTF Agenda Item 4, Review of Unicode boundaries

The ICU4J rules for word break are not too bad.
I like the rule that cannot have more than one punctutation mark in a
rows; this solves the problems of – and ... .
But here are some diasgreements:
1. Why allow quotation mark in a word?
2. Why allow quotation mark in a number?
3. Since it allows # and $ as the beginning part of a number, why not also
allow - and + ?
4. I would also like to allow / in the middle of a number, both to
support fractions e.g. 1/2 and also dates (a kind of numeric type)
e.g. 12/25/99
5. I might also like to allow the HYPHEN-MINUS in the middle of a
number, to support mostly numeric identifers such as ISBN (althoough
that can have a last character that is a letter), Social security
numbers, phone numbers, etc. The biggest problem with breaking these
into short strings at - is that some will be so short, e.g. only 1
character for last and often first in ISBN, that those might be
discarded as too short. (Like stop words, but determined by length.)
AltaVista is good in this area – it indexes everything. But most
systems don't.
6. There probably needs to be a better way to control whether a period
at the end of an abbreviation is attached to the word or split off.
The current implementation always splits this off, and produces tokens
such as "e.g" from the the input "e.g.".

Status

Assignee

Andy Heninger

Reporter

TracBot

Labels

Time Needed

Hours

Components

Fix versions

Priority

major