PyCantonese: Cantonese Linguistics and NLP in Python
PyCantonese is a Python library for Cantonese linguistics and natural language processing (NLP). Currently implemented features:
Accessing and searching corpus data
Parsing and conversion tools for Jyutping romanization
Parsing Cantonese text
Stop words
Word segmentation
Part-of-speech tagging
The design of PyCantonese prioritizes ease of use and linguistic knowledge. It has been successfully used by both academic and commercial organizations, including major US tech companies.
Download and Install
To download and install the stable, most recent version:
# Through pip
pip install --upgrade pycantonese
# Through conda-forge
conda install -c conda-forge pycantonese
For Pyodide users, PyCantonese now ships a WASM wheel, attached to a
GitHub release
(find the .whl with emscripten in the name).
Ready for more? Check out the Quickstart page.
Links
Author: Jackson L. Lee
Source code: https://github.com/jacksonllee/pycantonese
Social media: Facebook
How to Cite
Lee, Jackson L., Litong Chen, Charles Lam, Chaak Ming Lau, and Tsz-Him Tsui. 2022. PyCantonese: Cantonese Linguistics and NLP in Python. Proceedings of the 13th Language Resources and Evaluation Conference.
@inproceedings{lee-etal-2022-pycantonese,
title = "PyCantonese: Cantonese Linguistics and NLP in Python",
author = "Lee, Jackson L. and
Chen, Litong and
Lam, Charles and
Lau, Chaak Ming and
Tsui, Tsz-Him",
booktitle = "Proceedings of The 13th Language Resources and Evaluation Conference",
month = jun,
year = "2022",
publisher = "European Language Resources Association",
}
License
MIT License. Please see LICENSE.txt in the GitHub source code for details.
PyCantonese includes data from the following sources (please see src/pycantonese/data for details):
Hong Kong Cantonese Corpus (CC BY)
rime-cantonese (CC BY 4.0)
Common Voice Cantonese (Mozilla Public License 2.0)
Cantonese-Traditional Chinese Parallel Corpus (CC0 1.0 Universal)
Logo
The PyCantonese logo is the Chinese character 粵 meaning Cantonese, with artistic design by albino.snowman (Instagram handle).