Add Collation "search" variant for Korean

Description

Deleted Component: other

Jungshik and I were talking about searching, and it looks like we should add an
additional collator designed for searching Korean, where the input is directly
from the user (eg Find in page, Autocompletion, etc).

We just tried it out, and it looks like a set of the following will produce the
right results:

&ᄀ = ᆨ
&ᄀᄀ = ᆩ = ᄁ
...

That is,

  • set all the trailing jamo to be equal to leading jamo.

  • set all the complex jamo to be equal to a series of leading jamo

  • (This may need some further tailoring).

xpath

None

locale

None

Activity

Show:
TracBot
May 10, 2019, 7:10 AM
Trac Comment 21 by —2014-04-22T20:37:42.506Z

Milestone 1.9m2 deleted

TracBot
May 10, 2019, 7:10 AM
Trac Comment 20 by verdy_p@abeef3a88dc95339—2011-09-09T22:48:09.000Z

Note that the current OpenType specifications currently FORBIDS these decompositions/recompositions, as a validity condition for substitution rules (see Appendix B for the list of "old jamos" for which it's acceptable to have "ligatures" defined for the characters layout. This means that it is expected, for example that <KIEOK,KIEOK> cannot be rendered as <SSANGKIEOK>.

A string that would only use simple jamos (where leading and trailing consonants are not differentiated), from the old KSX standard (and the obsolete Unicode 1.0 encoding of these jamos) may heve difficulties to find where exactly there are syllable breaks (difficulties similar to what appears in the LAtin alphabet), in a succession of "simple" consonnant jamos. There probably exists rules to infer where the syllable break occurs (SSANG consonnants are easy to detect), but there are probably exceptions. A dictionary lookup may exhibit those exceptions by inserting a zero-width space or word-joiner in the middle, or at begining or end of the succession of consonnants (is there another separator used, such as SHY?), so that reconstruction of consonnant clusters can reliably be performed using Hangul everywhere as possible.

I think that such reconstruction may be useful for handling SMS messages on GSM phones that use a simplified input method, and algorithms certainly exist to perform such transform, either directly as an IME on input, or on reception to enhance the presentation of long lists of linear jamos into square Hangul syllables, or to perform the reverse transform from squares to linear simple jamos for accessibility reasons. But depending on implementations (or possible long lists of exceptions on the effective syllable breaking rules), we may see various forms of strings transmitted on the network or in documents (that Korean users still have no difficulty to read and interpret, even if it's not the "preferred" layout form for modern Hangul). Historical documents (e.g. produced on mechanical typewriters) will also likely not use the modern Hangul square composition rules (with visible syllable breaks rendered by the layout in squares), and it's even possible that, for avoiding some ambiguities, regular SPACEs or HYPHEN separators have been used to separate the syllables explicitly, where modern Hangul does not use any space or punctuation separation between syllables (and often also not between words).

TracBot
May 10, 2019, 7:10 AM
Trac Comment 19 by verdy_p@abeef3a88dc95339—2011-09-09T22:21:45.000Z

Complex jamo decompositions may still be given different levels their collation weight differences. For example complex jamos used in modern hangul (that are part of encoded Hangul syllables) and available on all keyboards (e.g. SSANG consonnants) would probably differentiate at a lower level, say level 2, even if they are still unified at level 1.

  • level 1 would only differenciate the simple non-decomposable jamos, without distinction of leading and trailing consonnants.

  • level 2 would differentiate simple jamos and precomposed jamos used in modern syllables (notably SSANG consonnants).

  • level 3 would differentiate leading and trailing jamo consonnants (that are unified at level 1).

  • other complex jamos (only old jamos) would be given differences at level 4 (because not all keyboards can type them consistently, they may occur decomposed or precomposed); this concerns all the remaining jamos that are currently distinguished only at level 1 in the DUCET.

  • only the last level would exhibit all differences between old jamos, fully decomposed using expansions with non-zero weights on all levels for all trailing "simple" jamos.

This way we can still have a single collation table which would remain usable for searching with the appropriate matching (loosest matching at level 1, strictest matching at the highest level, which would be equivalent to the existing primary differences exhibited in the DUCET or CLDR-root's modified DUCET, the Google matching being at an intermediate level).

TracBot
May 10, 2019, 7:10 AM
Trac Comment 18 by —2010-10-28T00:24:12.000Z

With the followup changes in , Jungshik says this is OK.

TracBot
May 10, 2019, 7:10 AM
Trac Comment 17 by —2010-10-14T18:25:58.000Z

Hmm.. could you tell me the rationale for moving that to the Korean locale? That means, Korean search (as in Find-in-page) only works correctly in the Korean locale, which is not what most customers of CLDR/ICU expect.

Korean script is used only by Korean language and I don't see why it cannot be in Root (data size is too large? I guess not). Ok. I'll comment more on .

Priority

medium

Assignee

Peter Edberg

Reporter

TracBot

Reviewer

Jungshik Shin

Labels