collation element iterator needs to return quaternary-level data

Description

The collation element iterator currently returns primary+secondary+case+tertiary
level data. It also needs to return quaternary-level data (for shifted and
Hiragana). Add a function that returns all the bits. Reserve some bits for
further, future levels.

No need to return the identical level because that is just the NFD form of the
original string.

Activity

Show:
TracBot
June 30, 2018, 11:50 PM
Trac Comment by auditor—1970-01-01T01:28:03.000Z
  • 12/24/02 17:45:37 hshih changed notes2

  • 04/14/03 19:09:12 ram changed notes2

  • 05/30/03 11:34:49 hshih changed notes2

  • 06/16/03 18:03:03 hshih changed notes2

  • 02/05/04 19:57:23 weiv changed notes2

  • 02/06/04 16:31:14 weiv changed notes2

  • 02/17/04 03:32:57 weiv changed notes2

  • Tue Sep 27 10:12:31 2005 weiv changed notes2: target: "3.0" to "UNSCH", xref: "3536" to "3536 4782", comments: "

    " to "",

TracBot
June 30, 2018, 11:50 PM
Trac Comment 2 by —2012-06-01T15:49:14.600Z

Setting priority=zero because in ten years no one seems to have found a need for this. The collation element iterator is used in string search (e.g., web browser ctrl-F in-page search), and the tendency there is towards ignoring lower-level differences.

TracBot
June 30, 2018, 11:50 PM
Trac Comment 3 by —2012-07-18T23:10:40.987Z

The collation element iterator should also do at least some pre-processing according to the Collator attributes, e.g., alternate=shifted blanking levels 1..3, upperFirst inverting the case weights.

TracBot
June 30, 2018, 11:50 PM
Trac Comment 4 by —2012-07-24T00:49:35.657Z

This would also be useful because an API that returns (partially processed) weights would make it easier to change the bit fields of a collation element integer. We should deprecate the API that returns 32-bit integers.

TracBot
June 30, 2018, 11:50 PM
Trac Comment 3.5 by —2012-09-17T22:01:34.408Z

Replying to (Comment 3 markus):

... some pre-processing according to the Collator attributes, e.g., alternate=shifted blanking levels 1..3, ...

Complication: With alternate=shifted, primary ignorables become completely ignorable after a shifted CE. When iterating backwards and we get a primary ignorable, we will have to iterate further until we get the next primary CE, buffer the intervening primary CEs with their source indexes, and discard them if the primary CE is variable. We could leave this to the caller, but then they have to do this same processing.

The Boyer-Moore String Search implementation might go away or might be redone. We might be able to limit a new collation iterator API to only iterating forward, or only returning primary weights when going backwards. It may be even better to encapsulate the key/pattern in a class that holds the internal CE representation and has a "matchesAt" function that checks for a match from a starting point in the text. If we can limit backward iteration to reading collation grapheme clusters and returning whether there is any primary weight, we might not need to expose CEs and weights on a new API at all.

Assignee

weivsara@gmail.com

Reporter

Markus Scherer

Components

Labels

None

Reviewer

None

Priority

medium

Time Needed

Weeks

Fix versions

None