We're updating the issue view to help you get more done. 

Korean word break problem

Description

[Eclipse bug[ICU-444897]|https://bugs.eclipse.org/bugs/show_bug.cgi?id=444897]

A contiguous Hangul characters are treated as a word by ICU (and Java) word break iterator. For a Korean text "한글 문장입니다", word boundaries are as follow : "한글| |문장입니다|".

ICU4J 52 works well with the text "한글 문장입니다", however, the ICU version has a problem when an input text starting Latin characters. For example, input text: "abc 한글 문장입니다"


abc| |'''한|글'''| |문장입니다| (ICU4J 52)

abc| |'''한글'''| |문장입니다| (Other ICU4J versions)


This problem was caused by the introduction of dictionary based word break for East Asian scripts, and the bug in ICU4J 52 was fixed in later version of ICU4J.

Eclipse team want a quick fix for this problem based on ICU4J 52.

Environment

Status

Assignee

Yoshito Umaoka

Reporter

Yoshito Umaoka

Labels

tracCreated

Jan 06, 2015, 5:18 PM

tracOwner

yoshito

tracProject

ICU4J

tracReporter

yoshito

tracResolution

fixed

tracReviewer

scott_russell

tracStatus

closed

Components

Fix versions

Priority

medium