pycantonese.segment

pycantonese.segment(unsegmented: str, *, offsets: Literal[False] = False) → list[str][source]

pycantonese.segment(unsegmented: str, *, offsets: Literal[True]) → list[tuple[str, tuple[int, int]]]

Segment the unsegmented input.

The word segmentation model is a Jieba-styled DAG+HMM hybrid segmenter, trained by HKCanCor, rime-cantonese, Common Voice Cantonese, and Cantonese-Traditional Chinese Parallel Corpus.

Parameters:

unsegmented (str) – Unsegmented input.
offsets (bool, optional) – If True, return each word as a (word, (start, end)) tuple where start and end are character offsets into the original unsegmented string (exclusive end, like Python slices). Defaults to False.

Returns:

list[str] or list[tuple[str, tuple[int, int]]]

Examples

>>> segment("廣東話容唔容易學？")  # "Is Cantonese easy to learn?"
['廣東話', '容', '唔', '容易', '學', '？']
>>> segment("廣東話容唔容易學？", offsets=True)
[('廣東話', (0, 3)), ('容', (3, 4)), ('唔', (4, 5)),
 ('容易', (5, 7)), ('學', (7, 8)), ('？', (8, 9))]