We're updating the issue view to help you get more done. 

rfe: Add Script check to UnicodeSet

Description

Recommend adding script property to Unicode class (and corresponding C API).
Recommend following:

public static final byte // SCRIPT CODE
COMMON_SCRIPT = 0,
LATIN_SCRIPT = 1,
GREEK_SCRIPT = 2,
CYRILLIC_SCRIPT = 3,
ARMENIAN_SCRIPT = 4,
HEBREW_SCRIPT = 5,
ARABIC_SCRIPT = 6,
SYRIAC_SCRIPT = 7,
THAANA_SCRIPT = 8,
DEVANAGARI_SCRIPT = 9,
BENGALI_SCRIPT = 10,
GURMUKHI_SCRIPT = 11,
GUJARATI_SCRIPT = 12,
ORIYA_SCRIPT = 13,
TAMIL_SCRIPT = 14,
TELUGU_SCRIPT = 15,
KANNADA_SCRIPT = 16,
MALAYALAM_SCRIPT = 17,
SINHALA_SCRIPT = 18,
THAI_SCRIPT = 19,
LAO_SCRIPT = 20,
TIBETAN_SCRIPT = 21,
MYANMAR_SCRIPT = 22,
GEORGIAN_SCRIPT = 23,
JAMO_SCRIPT = 24,
HANGUL_SCRIPT = 25,
ETHIOPIC_SCRIPT = 26,
CHEROKEE_SCRIPT = 27,
ABORIGINAL_SCRIPT = 28,
OGHAM_SCRIPT = 29,
RUNIC_SCRIPT = 30,
KHMER_SCRIPT = 31,
MONGOLIAN_SCRIPT = 32,
HIRAGANA_SCRIPT = 33,
KATAKANA_SCRIPT = 34,
BOPOMOFO_SCRIPT = 35,
HAN_SCRIPT = 36,
YI_SCRIPT = 37;

The code can be based on Unicode blocks, as follows (Java sample):

public static byte getScript(char c) {
return Character.isLetter(c) ? getScript(getBlock(c)) : COMMON_SCRIPT;
}

private static byte getScript(byte block) {
return blockToScript[block];
}

Here is the block to script mapping:

static final byte[] blockToScript = {
COMMON_SCRIPT, // 0, <RESERVED_BLOCK>
LATIN_SCRIPT, // 1, BASIC_LATIN
LATIN_SCRIPT, // 2, LATIN_1_SUPPLEMENT
LATIN_SCRIPT, // 3, LATIN_EXTENDED_A
LATIN_SCRIPT, // 4, LATIN_EXTENDED_B
LATIN_SCRIPT, // 5, IPA_EXTENSIONS
COMMON_SCRIPT, // 6, SPACING_MODIFIER_LETTERS
COMMON_SCRIPT, // 7, COMBINING_DIACRITICAL_MARKS
GREEK_SCRIPT, // 8, GREEK
CYRILLIC_SCRIPT, // 9, CYRILLIC
ARMENIAN_SCRIPT, // 10, ARMENIAN
HEBREW_SCRIPT, // 11, HEBREW
ARABIC_SCRIPT, // 12, ARABIC
SYRIAC_SCRIPT, // 13, SYRIAC
THAANA_SCRIPT, // 14, THAANA
DEVANAGARI_SCRIPT, // 15, DEVANAGARI
BENGALI_SCRIPT, // 16, BENGALI
GURMUKHI_SCRIPT, // 17, GURMUKHI
GUJARATI_SCRIPT, // 18, GUJARATI
ORIYA_SCRIPT, // 19, ORIYA
TAMIL_SCRIPT, // 20, TAMIL
TELUGU_SCRIPT, // 21, TELUGU
KANNADA_SCRIPT, // 22, KANNADA
MALAYALAM_SCRIPT, // 23, MALAYALAM
SINHALA_SCRIPT, // 24, SINHALA
THAI_SCRIPT, // 25, THAI
LAO_SCRIPT, // 26, LAO
TIBETAN_SCRIPT, // 27, TIBETAN
MYANMAR_SCRIPT, // 28, MYANMAR
GEORGIAN_SCRIPT, // 29, GEORGIAN
JAMO_SCRIPT, // 30, HANGUL_JAMO
ETHIOPIC_SCRIPT, // 31, ETHIOPIC
CHEROKEE_SCRIPT, // 32, CHEROKEE
ABORIGINAL_SCRIPT, // 33, UNIFIED_CANADIAN_ABORIGINAL_SYLLABICS
OGHAM_SCRIPT, // 34, OGHAM
RUNIC_SCRIPT, // 35, RUNIC
KHMER_SCRIPT, // 36, KHMER
MONGOLIAN_SCRIPT, // 37, MONGOLIAN
LATIN_SCRIPT, // 38, LATIN_EXTENDED_ADDITIONAL
GREEK_SCRIPT, // 39, GREEK_EXTENDED
COMMON_SCRIPT, // 40, GENERAL_PUNCTUATION
COMMON_SCRIPT, // 41, SUPERSCRIPTS_AND_SUBSCRIPTS
COMMON_SCRIPT, // 42, CURRENCY_SYMBOLS
COMMON_SCRIPT, // 43, COMBINING_MARKS_FOR_SYMBOLS
COMMON_SCRIPT, // 44, LETTERLIKE_SYMBOLS
COMMON_SCRIPT, // 45, NUMBER_FORMS
COMMON_SCRIPT, // 46, ARROWS
COMMON_SCRIPT, // 47, MATHEMATICAL_OPERATORS
COMMON_SCRIPT, // 48, MISCELLANEOUS_TECHNICAL
COMMON_SCRIPT, // 49, CONTROL_PICTURES
COMMON_SCRIPT, // 50, OPTICAL_CHARACTER_RECOGNITION
COMMON_SCRIPT, // 51, ENCLOSED_ALPHANUMERICS
COMMON_SCRIPT, // 52, BOX_DRAWING
COMMON_SCRIPT, // 53, BLOCK_ELEMENTS
COMMON_SCRIPT, // 54, GEOMETRIC_SHAPES
COMMON_SCRIPT, // 55, MISCELLANEOUS_SYMBOLS
COMMON_SCRIPT, // 56, DINGBATS
COMMON_SCRIPT, // 57, BRAILLE_PATTERNS
HAN_SCRIPT, // 58, CJK_RADICALS_SUPPLEMENT
HAN_SCRIPT, // 59, KANGXI_RADICALS
HAN_SCRIPT, // 60, IDEOGRAPHIC_DESCRIPTION_CHARACTERS
COMMON_SCRIPT, // 61, CJK_SYMBOLS_AND_PUNCTUATION
HIRAGANA_SCRIPT, // 62, HIRAGANA
KATAKANA_SCRIPT, // 63, KATAKANA
BOPOMOFO_SCRIPT, // 64, BOPOMOFO
JAMO_SCRIPT, // 65, HANGUL_COMPATIBILITY_JAMO
HAN_SCRIPT, // 66, KANBUN
BOPOMOFO_SCRIPT, // 67, BOPOMOFO_EXTENDED
COMMON_SCRIPT, // 68, ENCLOSED_CJK_LETTERS_AND_MONTHS
COMMON_SCRIPT, // 69, CJK_COMPATIBILITY
HAN_SCRIPT, // 70, CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A
HAN_SCRIPT, // 71, CJK_UNIFIED_IDEOGRAPHS
YI_SCRIPT, // 72, YI_SYLLABLES
YI_SCRIPT, // 73, YI_RADICALS
HANGUL_SCRIPT, // 74, HANGUL_SYLLABLES
COMMON_SCRIPT, // 75, HIGH_SURROGATES
COMMON_SCRIPT, // 76, HIGH_PRIVATE_USE_SURROGATES
COMMON_SCRIPT, // 77, LOW_SURROGATES
COMMON_SCRIPT, // 78, PRIVATE_USE
HAN_SCRIPT, // 79, CJK_COMPATIBILITY_IDEOGRAPHS
COMMON_SCRIPT, // 80, ALPHABETIC_PRESENTATION_FORMS
ARABIC_SCRIPT, // 81, ARABIC_PRESENTATION_FORMS_A
COMMON_SCRIPT, // 82, COMBINING_HALF_MARKS
COMMON_SCRIPT, // 83, CJK_COMPATIBILITY_FORMS
COMMON_SCRIPT, // 84, SMALL_FORM_VARIANTS
ARABIC_SCRIPT, // 85, ARABIC_PRESENTATION_FORMS_B
COMMON_SCRIPT, // 86, SPECIALS
COMMON_SCRIPT, // 87, HALFWIDTH_AND_FULLWIDTH_FORMS
COMMON_SCRIPT, // 88, SPECIALS
};

Which depends on the following block constants:

public static final byte // block code
RESERVED_BLOCK = 0,
BASIC_LATIN = 1,
LATIN_1_SUPPLEMENT = 2,
LATIN_EXTENDED_A = 3,
LATIN_EXTENDED_B = 4,
IPA_EXTENSIONS = 5,
SPACING_MODIFIER_LETTERS = 6,
COMBINING_DIACRITICAL_MARKS = 7,
GREEK = 8,
CYRILLIC = 9,
ARMENIAN = 10,
HEBREW = 11,
ARABIC = 12,
SYRIAC = 13,
THAANA = 14,
DEVANAGARI = 15,
BENGALI = 16,
GURMUKHI = 17,
GUJARATI = 18,
ORIYA = 19,
TAMIL = 20,
TELUGU = 21,
KANNADA = 22,
MALAYALAM = 23,
SINHALA = 24,
THAI = 25,
LAO = 26,
TIBETAN = 27,
MYANMAR = 28,
GEORGIAN = 29,
HANGUL_JAMO = 30,
ETHIOPIC = 31,
CHEROKEE = 32,
UNIFIED_CANADIAN_ABORIGINAL_SYLLABICS = 33,
OGHAM = 34,
RUNIC = 35,
KHMER = 36,
MONGOLIAN = 37,
LATIN_EXTENDED_ADDITIONAL = 38,
GREEK_EXTENDED = 39,
GENERAL_PUNCTUATION = 40,
SUPERSCRIPTS_AND_SUBSCRIPTS = 41,
CURRENCY_SYMBOLS = 42,
COMBINING_MARKS_FOR_SYMBOLS = 43,
LETTERLIKE_SYMBOLS = 44,
NUMBER_FORMS = 45,
ARROWS = 46,
MATHEMATICAL_OPERATORS = 47,
MISCELLANEOUS_TECHNICAL = 48,
CONTROL_PICTURES = 49,
OPTICAL_CHARACTER_RECOGNITION = 50,
ENCLOSED_ALPHANUMERICS = 51,
BOX_DRAWING = 52,
BLOCK_ELEMENTS = 53,
GEOMETRIC_SHAPES = 54,
MISCELLANEOUS_SYMBOLS = 55,
DINGBATS = 56,
BRAILLE_PATTERNS = 57,
CJK_RADICALS_SUPPLEMENT = 58,
KANGXI_RADICALS = 59,
IDEOGRAPHIC_DESCRIPTION_CHARACTERS = 60,
CJK_SYMBOLS_AND_PUNCTUATION = 61,
HIRAGANA = 62,
KATAKANA = 63,
BOPOMOFO = 64,
HANGUL_COMPATIBILITY_JAMO = 65,
KANBUN = 66,
BOPOMOFO_EXTENDED = 67,
ENCLOSED_CJK_LETTERS_AND_MONTHS = 68,
CJK_COMPATIBILITY = 69,
CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A = 70,
CJK_UNIFIED_IDEOGRAPHS = 71,
YI_SYLLABLES = 72,
YI_RADICALS = 73,
HANGUL_SYLLABLES = 74,
HIGH_SURROGATES = 75,
HIGH_PRIVATE_USE_SURROGATES = 76,
LOW_SURROGATES = 77,
PRIVATE_USE = 78,
CJK_COMPATIBILITY_IDEOGRAPHS = 79,
ALPHABETIC_PRESENTATION_FORMS = 80,
ARABIC_PRESENTATION_FORMS_A = 81,
COMBINING_HALF_MARKS = 82,
CJK_COMPATIBILITY_FORMS = 83,
SMALL_FORM_VARIANTS = 84,
ARABIC_PRESENTATION_FORMS_B = 85,
SPECIALS = 86,
HALFWIDTH_AND_FULLWIDTH_FORMS = 87;

(If it is useful, I have some code that uses about a 700 byte table for mapping
characters to blocks.)

Part 2. Add to UnicodeSet the ability to check for script matches, using the
(proposed) ISO 15942 codes.
Unfortunately some of the letter combininations collide with the General
Categories, so one would have to use some kind of syntax to distinguish them,
like
[:Las:] for Latin Script.

"Zz", // COMMON – not a letter: no exact correspondence in 15924
"La", // LATIN
"El", // GREEK
"Cy", // CYRILLIC
"Hy", // ARMENIAN
"He", // HEBREW
"Ar", // ARABIC
"Sy", // SYRIAC
"Tn", // THAANA
"Dv", // DEVANAGARI
"Bn", // BENGALI
"Gm", // GURMUKHI
"Gu", // GUJARATI
"Or", // ORIYA
"Ta", // TAMIL
"Te", // TELUGU
"Kn", // KANNADA
"Ml", // MALAYALAM
"Si", // SINHALA
"Th", // THAI
"Lo", // LAO
"Bo", // TIBETAN
"My", // MYANMAR
"Kx", // GEORGIAN
"Qa", // JAMO – not separated from Hangul in 15924
"Hg", // HANGUL
"Et", // ETHIOPIC
"Jl", // CHEROKEE
"Sl", // ABORIGINAL
"Og", // OGHAM
"Rn", // RUNIC
"Km", // KHMER
"Mn", // MONGOLIAN
"Hr", // HIRAGANA
"Kk", // KATAKANA
"Bp", // BOPOMOFO
"Ha", // HAN
"Yi", // YI
};

Status

Assignee

TracBot

Reporter

TracBot

Labels

Components

Priority

assess