We're updating the issue view to help you get more done. 

ICU LayoutEngine ignored ISCII syllable detection and splitting rules in ligature formation

Description

When the ICU LayoutEngine checks for syllable boundary before doing the reordering, it ignores one of ISCII's standards for maximum number of VIRAMA's in a syllable. For e.g.

''''Ka + Virama + Ka + Virama + Ka + Virama + Ka + Virama + Ka'''' should be split in the following way:

''''Ka + Virama + Ka + Virama + Ka + Virama + Ka + Virama' - First syllable
and 'Ka'''' - Second syllable

Proposed change: This can happen in the state table in IndicRepordering.cpp, which can look like this:

xx vm sm iv i2 ct cn nu dv s1 s2 s3 vr zw

{ 1, 1, 1, 5, 8, 3, 2, 1, 1, 9, 5, 1, 1, 1}, // 0 - ground state

{-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1}, // 1 - exit state

{-1, 6, 1, -1, -1, -1, -1, -1, 5, 9, 5, 5, 4, -1}, // 2 - consonant with nukta

{-1, 6, 1, -1, -1, -1, -1, 2, 5, 9, 5, 5, 4, -1}, // 3 – consonant

{-1, -1, -1, -1, -1, 12, 11, -1, -1, -1, -1, -1, -1, 7}, // 4 - consonant virama

{-1, 6, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1}, // 5 - dependent vowels

{-1, -1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1}, // 6 - vowel mark

{-1, -1, -1, -1, -1, 3, 2, -1, -1, -1, -1, -1, -1, -1}, // 7 - ZWJ, ZWNJ

{-1, 6, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 4, -1}, // 8 - independent vowels that can take a virama

{-1, 6, 1, -1, -1, -1, -1, -1, -1, -1, 10, 5, -1, -1}, // 9 - first part of split vowel

{-1, 6, 1, -1, -1, -1, -1, -1, -1, -1, -1, 5, -1, -1}, // 10 - second part of split vowel

{-1, 6, 1, -1, -1, -1, -1, -1, 5, 9, 5, 5, 13, -1}, // 11 - ct vr ct nu

{-1, 6, 1, -1, -1, -1, -1, 11, 5, 9, 5, 5, 13, -1}, // 12 - ct vr ct

{-1, -1, -1, -1, -1, 15, 14, -1, -1, -1, -1, -1, -1, 7}, // 13 - ct vr ct vr

{-1, 6, 1, -1, -1, -1, -1, -1, 5, 9, 5, 5, 16, -1}, // 14 - ct vr ct vr ct nu

{-1, 6, 1, -1, -1, -1, -1, 14, 5, 9, 5, 5, 16, -1}, // 15 - ct vr ct vr ct

{-1, -1, -1, -1, -1, 18, 17, -1, -1, -1, -1, -1, -1, 7}, // 16 - ct vr ct vr ct vr

{-1, 6, 1, -1, -1, -1, -1, -1, 5, 9, 5, 5, 19, -1}, // 17 - ct vr ct vr ct vr ct nu

{-1, 6, 1, -1, -1, -1, -1, 17, 5, 9, 5, 5, 19, -1}, // 18 - ct vr ct vr ct vr ct

{-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 7} // 19 - ct vr ct vr ct vr ct vr

States 11-18 are new states that do not allow more than 4 VIRAMA's in a syllable, and the 4th VIRAMA is explicit.

There is still, however one problem with this. Ligature formation, for some reason, does not use the syllable boundary information, and forms any ligatures it can as traverses through the input Unicode string. This problem is highlighted in the following example:

Input: '''Pa + Virama + Ka + Virama + Ssa(0937) + Virama + Ka + Virama + Ssa + Vowel Sign Aa'''

Expected syllable split:
''''Pa + Virama + Ka + Virama + Ssa(0937) + Virama + Ka + Virama' - First syllable
'Ssa + Vowel Sign Aa' - Second syllable'''

Expected ligature result:
''''Half Pa-Ligature KSsa-Ka-Explicit Virama' - First syllable
'Ssa Aa' - Second syllable'''

However, even after adding the new states responses to the state table, and splitting the syllable, the ligature formation code doesn't use the syllable information, and discards it. The ligatures are still formed as and when the characters are encoutnered in the string. Hence the result ends up like this:

''''Half Pa-Ligature KSsa-Ligature KSsa' - First syllable
'Vowel Sign A' - Second syllable'''

The combination of these two issues is clearly a defect, or missing functionality.

I am not sure about the purpose of syllable detection in the ICU LayoutEngine. I can see that it is used in the reordering funciton, but why that information is discarded during ligature formation, I have no idea.

Is it that clients of the ICU LayoutEngine are expected to detect syllables themselves, and only feed the LayoutEngine text syllable-by-syllable? i.e. the offset should be the syllable boundary?

Regards
Jasdeep

Environment

Status

Assignee

John Emmons

Reporter

TracBot

Time Needed

Days

tracCc

Myles.Benett@99c2de2ae3fc33d5,Tim.Band@99c2de2ae3fc33d5

tracCreated

Feb 09, 2007, 12:30 PM

tracOwner

emmons

tracProject

ICU4C

tracReporter

Jasdeep.Sawhney@99c2de2ae3fc33d5

tracResolution

fixed

tracReviewer

srl

tracStatus

closed

tracWeeks

0.5

Components

Fix versions

Priority

medium