PyCantonese: Cantonese Linguistics and NLP in Python

PyCantonese is a Python library for Cantonese linguistics and natural language processing (NLP). Currently implemented features:

Accessing and searching corpus data
Parsing and conversion tools for Jyutping romanization
Parsing Cantonese text
Stop words
Word segmentation
Part-of-speech tagging

The design of PyCantonese prioritizes ease of use and linguistic knowledge. It has been successfully used by both academic and commercial organizations, including major US tech companies.

Since v4.0.0 (March 2026), PyCantonese depends on Rustling, a library for efficient CHAT data handling, word segmentation, and part-of-speech tagging.

Download and Install

PyCantonese is available through Python and JavaScript.

pip install pycantonese

uv add pycantonese

conda install -c conda-forge pycantonese

Links

Author: Jackson L. Lee
Source code: https://github.com/jacksonllee/pycantonese
Social media: Facebook

How to Cite

Lee, Jackson L., Litong Chen, Charles Lam, Chaak Ming Lau, and Tsz-Him Tsui. 2022. PyCantonese: Cantonese Linguistics and NLP in Python. Proceedings of the 13th Language Resources and Evaluation Conference.

@inproceedings{lee-etal-2022-pycantonese,
   title = "PyCantonese: Cantonese Linguistics and NLP in Python",
   author = "Lee, Jackson L.  and
      Chen, Litong  and
      Lam, Charles  and
      Lau, Chaak Ming  and
      Tsui, Tsz-Him",
   booktitle = "Proceedings of The 13th Language Resources and Evaluation Conference",
   month = jun,
   year = "2022",
   publisher = "European Language Resources Association",
}

License

MIT License.

Please note that PyCantonese includes data from the following sources, all of which are permissively licensed:

Hong Kong Cantonese Corpus (CC BY)
CantoMap (GPL-3.0)
rime-cantonese (CC BY 4.0)
Common Voice Cantonese (Mozilla Public License 2.0)
Cantonese-Traditional Chinese Parallel Corpus (CC0 1.0 Universal)

For details about these datasets, please see their documentation.

Logo

The PyCantonese logo is the Chinese character 粵 meaning Cantonese, with artistic design by albino.snowman (Instagram handle).