API Reference

Corpus Data

`read_chat`(path, *[, filter_files, ...])	Read Cantonese CHAT data files.
`hkcancor`()	Create a corpus object for the Hong Kong Cantonese Corpus.
`CHAT`([chat])	A reader for Cantonese CHAT corpus data.
`CHAT.search`(*[, onset, nucleus, coda, tone, ...])	Search the data for the given criteria.

Jyutping Romanization

`characters_to_jyutping`(chars)	Convert Cantonese characters into Jyutping romanization.
`parse_jyutping`(jp_str)	Parse Jyutping romanization into onset, nucleus, coda, and tone.
`jyutping_to_ipa`(jp_str[, return_as, onsets, ...])	Convert Jyutping romanization into IPA.
`jyutping_to_yale`(jp_str[, return_as])	Convert Jyutping romanization into Yale romanization.
`jyutping_to_tipa`(jp_str)	Convert Jyutping romanization into LaTeX TIPA.

Natural Language Processing

`stop_words`([add, remove])	Return Cantonese stop words.
`parse_text`(data, *[, pos_tag_kwargs, ...])	Parse raw Cantonese text.
`segment`(unsegmented)	Segment the unsegmented input.
`pos_tag`(words[, tagset])	Tag the words for their parts of speech.
`pos_tagging.hkcancor_to_ud`([tag])	Map a part-of-speech tag from HKCanCor to Universal Dependencies.

`CHAT`

class pycantonese.CHAT(chat: Chat | None = None)[source]

A reader for Cantonese CHAT corpus data.

This class wraps a Rust-backed CHAT parser and provides Cantonese-specific functionality such as Jyutping extraction, character-level access, and corpus search.

Attributes:

file_paths: The file paths.
n_files: The number of files.

Methods

`ages`()	Return the ages.
`append`(other)	Append another CHAT object's data.
`characters`(*[, by_utterance, by_file])	Return the data in individual Chinese characters.
`extend`(others)	Extend with data from multiple CHAT objects.
`filter`(*[, participants, files])	Filter the data by participants and/or files.
`from_dir`(path, *[, match, extension, ...])	Read CHAT data from a directory.
`from_files`(paths, *[, parallel, strict])	Read CHAT data from file paths.
`from_strs`(strs, *[, ids, parallel, strict])	Read CHAT data from strings.
`from_utterances`(utterances)	Construct a CHAT reader from a list of utterances.
`from_zip`(path, *[, match, extension, ...])	Read CHAT data from a ZIP file.
`head`([n])	Return the first n utterances with a formatted display.
`headers`()	Return the headers.
`info`([verbose])	Print summary information.
`jyutping`(*[, by_utterance, by_file])	Return the data in Jyutping romanization.
`languages`(*[, by_file])	Return the languages.
`participants`(*[, by_file])	Return the participants.
`search`(*[, onset, nucleus, coda, tone, ...])	Search the data for the given criteria.
`tail`([n])	Return the last n utterances with a formatted display.
`to_chat`(path, *[, is_dir, filenames])	Write the data to CHAT file(s).
`to_strs`()	Return the data as CHAT-formatted strings.
`tokens`(*[, by_utterance, by_file])	Return the tokens.
`utterances`(*[, by_file])	Return the utterances.
`word_ngrams`(n)	Return word n-grams across all utterances.
`words`(*[, by_utterance, by_file])	Return the words.

ages()[source]: Return the ages.

append(other)[source]: Append another CHAT object’s data.

characters(*, by_utterance=False, by_file=False) → list[str] | list[list[str]] | list[list[list[str]]][source]

Return the data in individual Chinese characters.

Parameters:

by_utterancebool, optional: If True, return characters grouped by utterance.
by_filebool, optional: If True, return characters grouped by file.

Returns:

list

extend(others)[source]: Extend with data from multiple CHAT objects.

property file_paths: The file paths.

filter(*, participants=None, files=None)[source]

Filter the data by participants and/or files.

Parameters:

participantsstr, optional: Regex pattern to match participant codes.
filesstr, optional: Glob pattern to match file paths.

Returns:

CHAT

classmethod from_dir(path: str | PathLike[str], *, match: str | None = None, extension='.cha', parallel=True, strict=True)[source]

Read CHAT data from a directory.

Parameters:

pathstr or os.PathLike[str]: Path to the directory.
matchstr, optional: Glob pattern to match filenames within the directory.
extensionstr, optional: File extension to match. Default is ".cha".
parallelbool, optional: If True, parse files in parallel.
strictbool, optional: If True, enforce strict parsing.

Returns:

CHAT

classmethod from_files(paths: Sequence[str | PathLike[str]], *, parallel=True, strict=True)[source]

Read CHAT data from file paths.

Parameters:

pathsSequence[str | os.PathLike[str]]: Paths to CHAT files.
parallelbool, optional: If True, parse files in parallel.
strictbool, optional: If True, enforce strict parsing.

Returns:

CHAT

classmethod from_strs(strs, *, ids=None, parallel=True, strict=True)[source]

Read CHAT data from strings.

Parameters:

strslist[str]: CHAT-formatted strings.
idslist[str], optional: Identifiers for each string.
parallelbool, optional: If True, parse strings in parallel.
strictbool, optional: If True, enforce strict parsing.

Returns:

CHAT

classmethod from_utterances(utterances)[source]

Construct a CHAT reader from a list of utterances.

Creates a new reader containing a single virtual file with the given utterances. Useful for splitting a reader into sub-readers based on utterance boundaries.

Parameters:

utterancesSequence[Utterance]: Utterance objects to include.

Returns:

CHAT

classmethod from_zip(path: str | PathLike[str], *, match: str | None = None, extension='.cha', parallel=True, strict=True)[source]

Read CHAT data from a ZIP file.

Parameters:

pathstr or os.PathLike[str]: Path to the ZIP file.
matchstr, optional: Glob pattern to match filenames within the ZIP.
extensionstr, optional: File extension to match. Default is ".cha".
parallelbool, optional: If True, parse files in parallel.
strictbool, optional: If True, enforce strict parsing.

Returns:

CHAT

head(n=5)[source]: Return the first n utterances with a formatted display.

headers()[source]: Return the headers.

info(verbose=False)[source]: Print summary information.

Return the data in Jyutping romanization.

Parameters:

by_utterancebool, optional: If True, return Jyutping grouped by utterance.
by_filebool, optional: If True, return Jyutping grouped by file.

Returns:

list

languages(*, by_file=False)[source]: Return the languages.

property n_files: The number of files.

participants(*, by_file=False)[source]: Return the participants.

search(*, onset=None, nucleus=None, coda=None, tone=None, initial=None, final=None, jyutping=None, character=None, pos=None, word_range=(0, 0), utterance_range=(0, 0), by_token=True, by_utterance=False, by_file=False)[source]

Search the data for the given criteria.

Parameters:

onsetstr, optional: Onset to search for. A regex is supported.
nucleusstr, optional: Nucleus to search for. A regex is supported.
codastr, optional: Coda to search for. A regex is supported.
tonestr, optional: Tone to search for. A regex is supported.
initialstr, optional: Initial to search for. A regex is supported.
finalstr, optional: Final to search for.
jyutpingstr, optional: Jyutping romanization of one Cantonese character to search for.
characterstr, optional: One or more Cantonese characters to search for.
posstr, optional: A part-of-speech tag to search for. A regex is supported.
word_rangetuple[int, int], optional: Span of words around a match. Default is (0, 0).
utterance_rangetuple[int, int], optional: Span of utterances around a match. Default is (0, 0).
by_tokenbool, optional: If True, return Token objects. Otherwise return word strings.
by_utterancebool, optional: If True, return full utterances containing matches.
by_filebool, optional: If True, return data organized by file.

Returns:

list

tail(n=5)[source]: Return the last n utterances with a formatted display.

to_chat(path: str | PathLike[str], *, is_dir=False, filenames=None)[source]

Write the data to CHAT file(s).

Parameters:

pathstr or os.PathLike[str]: Output path.
is_dirbool, optional: If True, write each file to a directory.
filenameslist[str], optional: Filenames for each file.

to_strs()[source]

Return the data as CHAT-formatted strings.

Returns:

list[str]

tokens(*, by_utterance=False, by_file=False) → list[Token] | list[list[Token]] | list[list[list[Token]]][source]

Return the tokens.

Parameters:

by_utterancebool, optional: If True, return tokens grouped by utterance.
by_filebool, optional: If True, return tokens grouped by file.

Returns:

list

utterances(*, by_file=False) → list[Utterance] | list[list[Utterance]][source]

Return the utterances.

Parameters:

by_filebool, optional: If True, return utterances grouped by file.

Returns:

list[Utterance] | list[list[Utterance]]

word_ngrams(n: int)[source]

Return word n-grams across all utterances.

N-grams do not cross utterance boundaries.

Parameters:

nint: The n-gram order (1 for unigrams, 2 for bigrams, etc.).

Returns:

Ngrams

words(*, by_utterance=False, by_file=False) → list[str] | list[list[str]] | list[list[list[str]]][source]

Return the words.

Parameters:

by_utterancebool, optional: If True, return words grouped by utterance.
by_filebool, optional: If True, return words grouped by file.

Returns:

list

`Token`

class pycantonese.corpus.Token(word, pos=None, jyutping=None, mor=None, gloss=None, gra=None)

A token with Cantonese-specific fields parsed from a CHAT utterance.

Attributes:

gloss
gra
jyutping
mor
pos
word

Methods

to_gra_tier
to_mor_tier

`Jyutping`

class pycantonese.jyutping.Jyutping(onset: str, nucleus: str, coda: str, tone: str)[source]

Jyutping representation of a Chinese/Cantonese character.

Attributes:

onsetstr: Onset
nucleusstr: Nucleus
codastr: Coda
tonestr: Tone

__eq__(other): Return self==value.

__hash__()[source]: Return hash(self).

__init__(onset: str, nucleus: str, coda: str, tone: str) → None

__repr__(): Return repr(self).

__str__()[source]: Combine onset + nucleus + coda + tone.

property final: Return the final (= nucleus + coda).

`Headers`

class rustling.chat.Headers

All file-level (non-changeable) headers from a CHAT file.

Attributes:

comments
date
languages
location
media
number
options
other
participants
pid
recording_quality
room_layout
situation
tape_location
time_duration
time_start
transcriber
transcription
types
videos
warning

`Ngrams`

class rustling.ngram.Ngrams(n, *, min_n=None)

Python-exposed wrapper. Python users see this as Ngrams.

Attributes:

min_n: The minimum n-gram order.
n: The n-gram order.

Methods

`clear`()	Clear all counts.
`count`(seq)	Count n-grams from a single sequence.
`count_seqs`(seqs)	Count n-grams from multiple sequences.
`get`(ngram)	Return the count for a specific n-gram.
`items`(*[, order])	Return all (n-gram, count) pairs.
`most_common`([n, order])	Return the n most common n-grams with their counts.
`to_counter`(*[, order])	Convert to a Python `collections.Counter`.
`total`(*[, order])	Return the total number of n-gram tokens counted.

API Reference

Corpus Data

Jyutping Romanization

Natural Language Processing

CHAT

Token

Jyutping

Headers

Ngrams

`CHAT`

`Token`

`Jyutping`

`Headers`

`Ngrams`