We're updating the issue view to help you get more done. 

RFE: Add script support (UTR 24)

Description

Proposal: Unicode Script Support in ICU4C/ICU4J

CURRENT STATUS

ICU4C and ICU4J will be updated to reflect UTR #24: Script Names
(based on ISO 15924:2000, "Code for the representation of names of
scripts".) UTR #24 describes the basis for a new Unicode data file,
Scripts.txt.

ICU4C currently has a C implementation of something that appears to be
script data, which is wrapped in a C++ API. The C implementation
looks like this:

enum UCharScript {
U_BASIC_LATIN,
U_LATIN_1_SUPPLEMENT,
/.../
U_CHAR_SCRIPT_COUNT,
U_NO_SCRIPT=U_CHAR_SCRIPT_COUNT
};
typedef enum UCharScript UCharScript;

UCharScript u_charScript(UChar32 () ch);

And in the C++ Unicode class:

static inline EUnicodeScript getScript(UChar32 () ch);

where EUnicodeScript is an enum that corresponds to UCharScript.

Although these API use the word "script", they are really blocks!
These entities will be renamed UCharBlock, U_CHAR_BLOCK_COUNT,
getBlock(), etc. We can do this through redefition/aliasing and
deprecation, following to standard ICU practices.

ENUM

We will then introduce an enum that reflects the entities defined in
UTR #24. (The numbering will be tool-generated and may be different
from what is listed here.)

enum UScriptCode {
U_COMMON = -1 , /* Zyyy */
U_INHERITED = 0, /* Qaai */
U_ARABIC = 1, /* Arab */
U_ARMENIAN = 2, /* Armn */
U_BENGALI = 3, /* Beng */
U_BOPOMOFO = 4, /* Bopo */
U_CHEROKEE = 5, /* Cher */
U_COPTIC = 6, /* Qaac */
U_CYRILLIC = 7, /* Cyrl (Cyrs) */
U_DESERET = 8, /* Dsrt */
U_DEVANAGARI = 9, /* Deva */
U_ETHIOPIC = 10, /* Ethi */
U_GEORGIAN = 11, /* Geor (Geon, Geoa) */
U_GOTHIC = 12, /* Goth */
U_GREEK = 13, /* Grek */
U_GUJARATI = 14, /* Gujr */
U_GURMUKHI = 15, /* Guru */
U_HAN = 16, /* Hani */
U_HANGUL = 17, /* Hang */
U_HEBREW = 18, /* Hebr */
U_HIRAGANA = 19, /* Hira */
U_KANNADA = 20, /* Knda */
U_KATAKANA = 21, /* Kana */
U_KHMER = 22, /* Khmr */
U_LAO = 23, /* Laoo */
U_LATIN = 24, /* Latn (Latf, Latg) */
U_MALAYALAM = 25, /* Mlym */
U_MONGOLIAN = 26, /* Mong */
U_MYANMAR = 27, /* Mymr */
U_OGHAM = 28, /* Ogam */
U_OLD_ITALIC = 29, /* Ital */
U_ORIYA = 30, /* Orya */
U_RUNIC = 31, /* Runr */
U_SINHALA = 32, /* Sinh */
U_SYRIAC = 33, /* Syrc (Syrj, Syrn, Syre) */
U_TAMIL = 34, /* Taml */
U_TELUGU = 35, /* Telu */
U_THAANA = 36, /* Thaa */
U_THAI = 37, /* Thai */
U_TIBETAN = 38, /* Tibt */
U_UCAS = 39, /* Cans */
U_YI = 40, /* Yiii */
};

BASIC API

A C API will make the data from Scripts.txt available:

/* Return U_MALAYAM given 0x0D02 */
UScriptCode uchar_getScript(UChar32 () codePoint);

/* Return "Malayam" given U_MALAYAM */
const char* uchar_getScriptName(UCharScriptNew scriptCode);

/* Return "Mlym" given U_MALAYAM */
const char* uchar_getScriptAbbr(UCharScriptNew scriptCode);

/* Return U_MALAYAM given "Malayam" OR "Mlym" */
UScriptCode uchar_getScriptCode(const char* nameOrAbbr);

/* Return set of characters in a script */
void uchar_getCharsInScript(UScriptCode scriptCode, UUnicodeSet *set);

In the last call, UUnicodeSet is the C object (not yet implemented)
corresponding to the C++ UnicodeSet class. This will be implemented
with open/close semantics and eventually the implementation would be
moved from C++ to C.

The Java API will be analogous. A C++ API may be unnecessary, except
for the last function, which probably we will want to supply using C++
UnicodeSet objects:

/* Return set of characters in a script */
void Unicode::getCharsInScript(UScriptCode scriptCode, UnicodeSet& set);

SCRIPT RUNS

Finally, there will be API to analyze script runs. This will handle
Common and Inherited characters according to their neighbors and
handle opening and closing punctuation pairs such as '(' and ')'.
There will be a C API to this as follows:

/**

  • Given start and end indices into a piece of text and an index to

  • a specific character, return the script run containing that

  • character. The run is specified via fill-in parameters for the

  • start and end indices of the run and through the function return

  • value, which is the script code. To obtain previous or next

  • runs, move the position index to character before or after the

  • returned run.

  • @param text text to analyze

  • @param start the start of the text to be analyzed, inclusive

  • index

  • @param limit the end of the text to be analyzed, exclusive index

  • @param pos index of a character in the run to be returned

  • @param runStart fill-in parameter to receive the script run start

  • index, inclusive

  • @param runLimit fill-in parameter to receive the script run end

  • index, exclusive

  • @param localeID the locale to use to determine matching

  • punctuation

  • @return the script code of text[*runStart..*runEnd-1] */
    UScriptCode uchar_getScriptRun(const UChar* text,
    int32_t start,
    int32_t limit,
    int32_t pos,
    int32_t* runStart,
    int32_t* runLimit,
    const char* localeID);

Because matching punctuation is locale-dependent (for quotes), we must
pass in a locale. Alternatively, we can have the user supply a
callback function that informs it about matching punctuation:

/**

  • Given a character, return its matching partner, or U+0000

  • if none. E.g., '(' matches ')' and '[matches '|']'.
    */
    typedef UChar32 () (*U_PUNCTUATION_MATCH_FN)(UChar32 ());

/**

  • Given start and end indices into a piece of text and an index to

  • a specific character, return the script run containing that

  • character. The run is specified via fill-in parameters for the

  • start and end indices of the run and through the function return

  • value, which is the script code. To obtain previous or next

  • runs, move the position index to character before or after the

  • returned run.

  • @param text text to analyze

  • @param start the start of the text to be analyzed, inclusive

  • index

  • @param limit the end of the text to be analyzed, exclusive index

  • @param pos index of a character in the run to be returned

  • @param runStart fill-in parameter to receive the script run start

  • index, inclusive

  • @param runLimit fill-in parameter to receive the script run end

  • index, exclusive

  • @param matchFn callback function that returns a punctuation

  • characters's matching partner, e.g., '(' and ')'.

  • @return the script code of text[*runStart..*runEnd-1] */
    UScriptCode uchar_getScriptRun(const UChar* text,
    int32_t start,
    int32_t limit,
    int32_t pos,
    int32_t* runStart,
    int32_t* runLimit,
    U_PUNCTUATION_MATCH_FN matchFn);

IMPLEMENTATION NOTES

Eric has already implemented most of this code, although not with this
API. The smart script run code has not been implemented yet. The
final implementation will consist of a Java tool (derived from Eric's
tool) that reads Scripts.txt and generates data tables, probably in
the form of C/Java source files. The Java tool will also generate
UnicodeString patterns for each script so these do not have to be
computed at run time. Relatively small functions (except for the
script run function) will access the data tables and implement the
API. The API will be public in C, C++, and Java.

Environment

Status

Assignee

TracBot

Reporter

TracBot

Labels

tracCreated

Mar 20, 2001, 11:20 PM

tracOwner

ram

tracReporter

alan@8d6336d19dc03735

tracResolution

fixed

tracReviewer

alan

tracStatus

closed

Components

Priority

assess