Some words in Telugu are not processed correctly by ICU

Description

I used ICU-C 3.6 and 3.8.d02 versions and none of them return correct indexes for some glyphs in Telugu. To test it I used "Sample/layout" program which is delivered with ICU-C source code. I compared it with OpenOffice and interesting thing is OO uses the ICU-C 3.6 as well but the text written there is correct. I suppose that there are some patches for this.
I used fonts such as:

  • Gautami

  • TLOT-Hemalatha Normal

  • TLOT-Hemalatha Italic

  • TLOT-Hemalatha Bold

  • TLOT-Hemalatha Bold Italic

  • and many others
    For all these fonts ICU returns inappropriate last index glyph. Below I put the sequence of unicodes which I input to achieve wrong results this is:
    U+0C2A;U+0C4D;U+0C30;U+0C15;U+0C3E;U+0C37;U+0C4D;

according to UNICODE standards they are:
0C2A;TELUGU LETTER PA;Lo;0;L;;;;;N;;;;;
0C4D;TELUGU SIGN VIRAMA;Mn;9;NSM;;;;;N;;;;;
0C30;TELUGU LETTER RA;Lo;0;L;;;;;N;;;;;
0C15;TELUGU LETTER KA;Lo;0;L;;;;;N;;;;;
0C3E;TELUGU VOWEL SIGN AA;Mn;0;NSM;;;;;N;;;;;
0C37;TELUGU LETTER SSA;Lo;0;L;;;;;N;;;;;
0C4D;TELUGU SIGN VIRAMA;Mn;9;NSM;;;;;N;;;;;

I attached:

  • FontMap.GDI, Sample.txt - input for "Sample/layout" program

  • LayoutSample.PNG - final text rendering by using "Sample/layout" program (incorrect)

  • OpenOffice.PNG - final text rendering by using OpenOffice (correct)

Activity

Show:
TracBot
July 1, 2018, 12:11 AM
Trac Comment 4 by eric—2007-10-04T18:24:12.000Z

KA + VIRAMA + SSA is an akhand ligature. This sequence gets reordered to KA + SSA + VIRAMA. An input sequence of KA + SSA + VIRAMA is two syllables, but all three glyphs get tagged with the 'AKHN' feature, so the akhand ligature will form, even though the glyphs aren't all in the same syllable. (This is a case where UniScribe's approach of processing one syllable at a time will do the right thing.)

The input sequence in the bug has an AA matra after the KA. The ligature still forms because the Ligature Substitution subtable in the fonts ignores all marks except for VIRAMA. A case could be made that it shouldn't, which would fix this particular case, but the same input sequence without the matra would still fail.

I'm not sure how to fix this. My best guess is to not apply features like 'AKHN' to syllables that are too short to match. (i.e. an akhand ligature will be at least three glyphs long)

TracBot
July 1, 2018, 12:11 AM
Trac Comment 5 by eric—2007-10-05T01:50:04.000Z

describes the same problem. It also seems that this problem can occur across the boundary of two "long" syllables, so syllable length cannot be used to solve this problem. Perhaps encoding a syllable number with each glyph and restricting all (Should contextual lookups be restricted to a single syllable?) lookups to the same syllable. (We could use just the low-order bit of the syllable number - perhaps steal a bit from the feature flags)

TracBot
July 1, 2018, 12:11 AM
Trac Comment 8 by —2008-05-21T17:55:21.000Z

Please remove the commented-out code in OTLE.cpp at lines 336 and 348.
I didn't see tests for this, are there any?

TracBot
July 1, 2018, 12:11 AM
Trac Comment 10 by sylwekbala@10625fd9bde9dccd—2008-06-09T13:28:24.000Z

I tested it on ICU 4.0.d02 version and some other sequences still doesn't work properly. Sequences like:
ప్రా
గ్రా

TracBot
July 1, 2018, 12:11 AM
Trac Comment 11 by anonymous—2009-09-28T12:01:50.000Z

[турецкие песни|http://art-dance.com.ua/]

Fixed

Assignee

TracBot

Reporter

TracBot

Components

Labels

None

Reviewer

None

Priority

major

Time Needed

Days

Fix versions