Corpus Data
CHAT Format
For a corpus dataset to be useful for modeling work beyond search queries, its source data has to be available in a machine-readable format. For Cantonese, several corpora that meet this criterion are those from CHILDES and TalkBank, thanks to research on Cantonese language acquisition in recent years. More generally, given the nature of Cantonese, many of its corpora are transcribed data from naturalistic speech. For these reasons, PyCantonese adopts the CHAT corpus format from CHILDES and TalkBank. CHAT is widely used, well-documented, and rich for linguistic annotations. PyCantonese uses the library Rustling to parse CHAT data files. For a primer on the CHAT data format, please see here.
Built-in Data
PyCantonese ships with two built-in corpora: HKCanCor and CantoMap.
HKCanCor
The Hong Kong Cantonese Corpus
(HKCanCor; license: CC BY) is available via the function hkcancor():
import pycantonese
hkcancor = pycantonese.hkcancor()
hkcancor.n_files # number of data files
# 58
len(hkcancor.words()) # number of words as segmented from all the utterances
# 153656
HKCanCor is word-segmented and annotated for both Jyutping romanization and part-of-speech tags.
The original HKCanCor source files are in an XML format. They have been converted to CHAT for incorporation into PyCantonese. On the format conversion, please consult this readme.
CantoMap
The CantoMap corpus
(license: GPL-3.0) is a collection of contemporary Hong Kong Cantonese
conversation recordings from MapTask exercises.
It is available via the function cantomap():
import pycantonese
cantomap = pycantonese.cantomap()
cantomap.n_files # number of data files
# 99
len(cantomap.words()) # number of words as segmented from all the utterances
# 118572
CantoMap is word-segmented and annotated for Jyutping romanization. Part-of-speech tags (HKCanCor tagset) are added by PyCantonese’s POS tagger during the conversion from ELAN to CHAT format.
The original CantoMap source files are in the ELAN annotation format (.eaf).
Because the CHAT format inherently requires word-segmented data,
it is a natural fit for the word-segmented, Jyutping-annotated CantoMap data.
The ELAN files have been converted to CHAT for incorporation into PyCantonese.
On the format conversion, please consult this
readme.
CHILDES and TalkBank Data
For corpora beyond the built-in ones, PyCantonese provides the function read_chat()
to read in Cantonese data in the CHAT format.
As of 2026, CHAT datasets are publicly available from TalkBank. If you visit the webpage of a specific dataset, you’ll have to logged in (account setup is free) before you can download the full transcripts as a ZIP archive to your local drive.
Note
All publicly available TalkBank datasets are associated with the CC BY-NC-SA 3.0 license.
Here are the Cantonese-related TalkBank datasets (in alphabetical order):
Custom Data
If you have your own CHAT data,
read_chat() accepts a local ZIP archive,
a local directory, or a single .cha file path.
For more control over how data is read, the CHAT class
provides the following class methods:
from_zip()– local ZIP archivefrom_dir()– local directoryfrom_files()– one or more local file pathsfrom_strs()– in-memory stringsfrom_git()– Git repository (cloned and cached)from_url()– URL to a ZIP archive (downloaded and cached)
The CHAT parser is powered by Rustling, with Cantonese-specific additions for Jyutping romanization, Chinese characters, and a general corpus search function.
If your data is in one of the
formats supported by Rustling,
you can use Rustling to parse it, apply any processing you need,
and create a CHAT object via from_strs().
For an example of this workflow, see how the CantoMap ELAN data is
converted for use in PyCantonese.