pycantonese.segment
- pycantonese.segment(unsegmented: str, *, offsets: Literal[False] = False) list[str][source]
- pycantonese.segment(unsegmented: str, *, offsets: Literal[True]) list[tuple[str, tuple[int, int]]]
Segment the unsegmented input.
The word segmentation model is a Jieba-styled DAG+HMM hybrid segmenter, trained by HKCanCor, rime-cantonese, Common Voice Cantonese, and Cantonese-Traditional Chinese Parallel Corpus.
- Parameters:
- Returns:
list[str] or list[tuple[str, tuple[int, int]]]
Examples
>>> segment("廣東話容唔容易學?") # "Is Cantonese easy to learn?" ['廣東話', '容', '唔', '容易', '學', '?'] >>> segment("廣東話容唔容易學?", offsets=True) [('廣東話', (0, 3)), ('容', (3, 4)), ('唔', (4, 5)), ('容易', (5, 7)), ('學', (7, 8)), ('?', (8, 9))]