We're updating the issue view to help you get more done. 

Korean word break problem

Description

[Eclipse bug[ICU-444897]|https://bugs.eclipse.org/bugs/show_bug.cgi?id=444897]

A contiguous Hangul characters are treated as a word by ICU (and Java) word break iterator. For a Korean text "한글 문장입니다", word boundaries are as follow : "한글| |문장입니다|".

ICU4J 52 works well with the text "한글 문장입니다", however, the ICU version has a problem when an input text starting Latin characters. For example, input text: "abc 한글 문장입니다"


abc| |'''한|글'''| |문장입니다| (ICU4J 52)

abc| |'''한글'''| |문장입니다| (Other ICU4J versions)


This problem was caused by the introduction of dictionary based word break for East Asian scripts, and the bug in ICU4J 52 was fixed in later version of ICU4J.

Eclipse team want a quick fix for this problem based on ICU4J 52.

Status

Assignee

Yoshito Umaoka

Reporter

Yoshito Umaoka

Labels

Components

Fix versions

Priority

medium