Corpus Data

CHAT Format

For a corpus dataset to be useful for modeling work beyond search queries, its source data has to be available in a machine-readable format. For Cantonese, several corpora that meet this criterion are those from CHILDES and TalkBank, thanks to research on Cantonese language acquisition in recent years. More generally, given the nature of Cantonese, many of its corpora are transcribed data from naturalistic speech. For these reasons, PyCantonese adopts the CHAT corpus format from CHILDES and TalkBank. CHAT is widely used, well-documented, and rich for linguistic annotations. PyCantonese uses the library Rustling to parse CHAT data files. For a primer on the CHAT data format, please see here.

Built-in Data

Currently, PyCantonese comes with one built-in corpus, the Hong Kong Cantonese Corpus (HKCanCor; license: CC BY), via the function hkcancor():

import pycantonese
hkcancor = pycantonese.hkcancor()
hkcancor.n_files  # number of data files
# 58
len(hkcancor.words()) # number of words as segmented from all the utterances
# 153656

HKCanCor is word-segmented and annotated for both Jyutping romanization and part-of-speech tags.

The original HKCanCor source files are in an XML format. They have been converted to CHAT for incorporation into PyCantonese. On the format conversion, please consult this readme.

CHILDES and TalkBank Data

For corpora other than HKCanCor, PyCantonese provides the function read_chat() to read in Cantonese data in the CHAT format.

As of 2026, CHAT datasets are publicly available from TalkBank. If you visit the webpage of a specific dataset, you’ll have to logged in (account setup is free) before you can download the full transcripts as a ZIP archive to your local drive.

Note

All publicly available TalkBank datasets are associated with the CC BY-NC-SA 3.0 license.

Here are the Cantonese-related TalkBank datasets (in alphabetical order):

Custom Data

If you have your own CHAT data locally and would like PyCantonese to handle it, read_chat() takes a path that can be a ZIP archive, a local directory, or a single CHAT file.

If more fine-grained control is needed when reading data, please check out CHAT, particularly the following methods:

The CHAT parser comes from Rustling, which both PyCantonese and PyLangAcq use. For more on reading CHAT data in general, please see PyLangAcq’s documentation.