We're updating the issue view to help you get more done. 

Change Thai and Khmer word break iterator to use word frequency table along with rule-based syllable breaks

Description

CJK break iterators are about to be added to ICU trunk using 'word unigram model'. It relies on word frequency table.

A similar approach can be taken for Thai and Khmer. Currently, it uses the longest-match algorithm.

In addition, we may consider 'pre-processing' the input to break it into syllables (using regex or other ways) before applying WUM.

Status

Assignee

googler@icu-project.org

Reporter

Jungshik Shin

Labels

Time Needed

Weeks

Components

Priority

minor