I'm looking at ushape.c and ArabicShaping.java and
I find that both of them misuse U+065C-U+065F as
"internal characters" for lam-alef ligatures, both
for "shaping" and for "unshaping".
The code positions U+065C-U+065F where unallocated
up to Unicode 4.0. But in the pipeline are allocations
of characters to all but one of these positions:
065A..065C 3 ARABIC VOWEL SIGN SMALL V ABOVE
ARABIC VOWEL SIGN INVERTED SMALL V ABOVE
ARABIC VOWEL SIGN DOT BELOW
065D..065E 2 ARABIC REVERSED DAMMA
ARABIC FATHA WITH TWO DOTS
Anyone using these (yet to be) allocated characters
(assuming that they stay at those positions) together
with uchape.c's or ArabicShaping.java's shaping will
get very unpleasantly surprised! Likewise for unshaping.
I realise that ushape.c and ArabicShaping.java are
essentially only for legacy use, but getting a lam-alef
in place of abovementioned characters is too harsh a
penalty for using "new" characters with these routines.
The easy patch is to use four BMP non-characters. But
then how do we know that those aren't used internally
elsewhere in the system, and are given in the input to
I also see that 0xFFFF (with the misleading comment that
this code would be in the PUA) used for the shaper's
internal purposes; the above question applies.
The TASHKEEL option seems strange. Combining characters
don't have "isolated" or "medial" forms... They can be
applied to a TATWEEL, but that is something else.