pycantonese.segment

pycantonese.segment(unsegmented: str) → list[str][source]

Segment the unsegmented input.

The word segmentation model is a Jieba-styled DAG+HMM hybrid segmenter, trained by HKCanCor, rime-cantonese, Common Voice Cantonese, and Cantonese-Traditional Chinese Parallel Corpus.

Parameters:

unsegmentedstr: Unsegmented input.

Returns:

list[str]

Examples

>>> segment("廣東話容唔容易學？")  # "Is Cantonese easy to learn?"
['廣東話', '容', '唔', '容', '易', '學', '？']