CJK break iterators are about to be added to ICU trunk using 'word unigram model'. It relies on word frequency table.
A similar approach can be taken for Thai and Khmer. Currently, it uses the longest-match algorithm.
In addition, we may consider 'pre-processing' the input to break it into syllables (using regex or other ways) before applying WUM.