Enhancements for collation

Description

(This was originally reported by Åke Persson; I summarize and lay out the
possible options below.)

Traditional Indic collation requires a large tailoring, because of the general
behavior of Virama. Let X and Yn be primary weights, X for a K and Yn for
independent vowels. Then what you need (eventually) is:

KA+VIRAMA => X,
KA+I-MATRA => X, Y1
KA+U-MATRA => X, Y2
...
KA {not followed by one of the above} => X, Y0

All of the above can be handled by UCA except for the last line. In order to do
the last line, one has to formulate a lot of rules. It is a bit similar to the
tailoring rules we have in CLDR, which allow the previous context to affect the
choice of weights. In this case, it is the following context that is important.

It would be possible to add to the CLDR rules, so that the matras could just
have the weights Y1, Y2, ..., and one would only need 2 rules, one for KA {not
followed by matras}, and one for K+VIRAMA.

We currently have context before, like:

<reset>a</reset>
<x><context>a</context><s>-</s></x>

we could also add context afterwards, so that we could have a rule like:

<reset>क्अ</reset>
<x><p>क</p><context>[ि-ौ \u094D |^]</context></x>

The interpretation would be that unless the context is matched, the above rule
is not invoked. So this puts क after क्अ (at a tertiary level), but only
if the क is not followed by any of the characters ि through ौ or virama.

There are three possible options.
A. Leave everything the way is; so the collation developer puts in all the
combinations.
B. Add the syntax to CLDR. Implementations that use CLDR data would have two
choices.
B1. Expand this rule into the longer list in generating the implementations
data. Eg, this could be done for ICU in the CLDR2ICU converter. (It would expand
out क + each of the character not in the context into a separate rule, as
above)
B2. Adapt to handle the syntax natively. The implementation could be like
what is done in ICU for contractions currently, except that the follow-on tests
would not actually absorb the characters.

xpath

None

locale

None

Status

Priority

assess

Assignee

Markus Scherer

Reporter

Mark Davis

tracReporter

mark

Reviewer

None

Labels

Components

Fix versions

None

phase

None
Configure