We're updating the issue view to help you get more done. 

Change Thai and Khmer word break iterator to use word frequency table along with rule-based syllable breaks

Description

CJK break iterators are about to be added to ICU trunk using 'word unigram model'. It relies on word frequency table.

A similar approach can be taken for Thai and Khmer. Currently, it uses the longest-match algorithm.

In addition, we may consider 'pre-processing' the input to break it into syllables (using regex or other ways) before applying WUM.

Environment

Status

Assignee

googler@icu-project.org

Reporter

Jungshik Shin (신정식)

Labels

Time Needed

Weeks

tracCc

andy,grhoten,mark,markus,pedberg

tracCreated

Jul 25, 2012, 5:46 PM

tracOwner

googler

tracProject

all

tracReporter

jungshik

tracResolution

duplicate

tracStatus

closed

tracWeeks

2

Components

Priority

minor