Word Segmentation
By convention, Cantonese is not written with word boundaries (like spaces in English).
However, in many natural language processing tasks, it is often necessary to
work with a segmented form of Cantonese data.
PyCantonese provides the function segment() that takes an
unsegmented text string in Cantonese characters and returns
the segmented version:
import pycantonese
pycantonese.segment("廣東話容唔容易學?") # Is Cantonese easy to learn?
# ['廣東話', '容', '唔', '容易', '學', '?']
The word segmentation is powered by a Jieba-styled, semi-supervised hybrid approach that combines a directed acyclic graph and a hidden Markov model. The segmenter is trained by HKCanCor, rime-cantonese, Common Voice Cantonese, and Cantonese-Traditional Chinese Parallel Corpus.