API Reference

Corpus Data

read_chat(path, *[, filter_files, ...])

Read Cantonese CHAT data files.

hkcancor()

Create a corpus object for the Hong Kong Cantonese Corpus.

CHAT([chat])

A reader for Cantonese CHAT corpus data.

CHAT.search(*[, onset, nucleus, coda, tone, ...])

Search the data for the given criteria.

Jyutping Romanization

characters_to_jyutping(chars)

Convert Cantonese characters into Jyutping romanization.

parse_jyutping(jp_str)

Parse Jyutping romanization into onset, nucleus, coda, and tone.

jyutping_to_ipa(jp_str[, return_as, onsets, ...])

Convert Jyutping romanization into IPA.

jyutping_to_yale(jp_str[, return_as])

Convert Jyutping romanization into Yale romanization.

jyutping_to_tipa(jp_str)

Convert Jyutping romanization into LaTeX TIPA.

Natural Language Processing

stop_words([add, remove])

Return Cantonese stop words.

parse_text(data, *[, pos_tag_kwargs, ...])

Parse raw Cantonese text.

segment(unsegmented)

Segment the unsegmented input.

pos_tag(words[, tagset])

Tag the words for their parts of speech.

pos_tagging.hkcancor_to_ud([tag])

Map a part-of-speech tag from HKCanCor to Universal Dependencies.

CHAT

class pycantonese.CHAT(chat: Chat | None = None)[source]

A reader for Cantonese CHAT corpus data.

This class wraps a Rust-backed CHAT parser and provides Cantonese-specific functionality such as Jyutping extraction, character-level access, and corpus search.

Attributes:
file_paths

The file paths.

n_files

The number of files.

Methods

ages()

Return the ages.

append(other)

Append another CHAT object's data.

characters(*[, by_utterance, by_file])

Return the data in individual Chinese characters.

extend(others)

Extend with data from multiple CHAT objects.

filter(*[, participants, files])

Filter the data by participants and/or files.

from_dir(path, *[, match, extension, ...])

Read CHAT data from a directory.

from_files(paths, *[, parallel, strict])

Read CHAT data from file paths.

from_strs(strs, *[, ids, parallel, strict])

Read CHAT data from strings.

from_utterances(utterances)

Construct a CHAT reader from a list of utterances.

from_zip(path, *[, match, extension, ...])

Read CHAT data from a ZIP file.

head([n])

Return the first n utterances with a formatted display.

headers()

Return the headers.

info([verbose])

Print summary information.

jyutping(*[, by_utterance, by_file])

Return the data in Jyutping romanization.

languages(*[, by_file])

Return the languages.

participants(*[, by_file])

Return the participants.

search(*[, onset, nucleus, coda, tone, ...])

Search the data for the given criteria.

tail([n])

Return the last n utterances with a formatted display.

to_chat(path, *[, is_dir, filenames])

Write the data to CHAT file(s).

to_strs()

Return the data as CHAT-formatted strings.

tokens(*[, by_utterance, by_file])

Return the tokens.

utterances(*[, by_file])

Return the utterances.

word_ngrams(n)

Return word n-grams across all utterances.

words(*[, by_utterance, by_file])

Return the words.

ages()[source]

Return the ages.

append(other)[source]

Append another CHAT object’s data.

characters(*, by_utterance=False, by_file=False) list[str] | list[list[str]] | list[list[list[str]]][source]

Return the data in individual Chinese characters.

Parameters:
by_utterancebool, optional

If True, return characters grouped by utterance.

by_filebool, optional

If True, return characters grouped by file.

Returns:
list
extend(others)[source]

Extend with data from multiple CHAT objects.

property file_paths

The file paths.

filter(*, participants=None, files=None)[source]

Filter the data by participants and/or files.

Parameters:
participantsstr, optional

Regex pattern to match participant codes.

filesstr, optional

Glob pattern to match file paths.

Returns:
CHAT
classmethod from_dir(path: str | PathLike[str], *, match: str | None = None, extension='.cha', parallel=True, strict=True)[source]

Read CHAT data from a directory.

Parameters:
pathstr or os.PathLike[str]

Path to the directory.

matchstr, optional

Glob pattern to match filenames within the directory.

extensionstr, optional

File extension to match. Default is ".cha".

parallelbool, optional

If True, parse files in parallel.

strictbool, optional

If True, enforce strict parsing.

Returns:
CHAT
classmethod from_files(paths: Sequence[str | PathLike[str]], *, parallel=True, strict=True)[source]

Read CHAT data from file paths.

Parameters:
pathsSequence[str | os.PathLike[str]]

Paths to CHAT files.

parallelbool, optional

If True, parse files in parallel.

strictbool, optional

If True, enforce strict parsing.

Returns:
CHAT
classmethod from_strs(strs, *, ids=None, parallel=True, strict=True)[source]

Read CHAT data from strings.

Parameters:
strslist[str]

CHAT-formatted strings.

idslist[str], optional

Identifiers for each string.

parallelbool, optional

If True, parse strings in parallel.

strictbool, optional

If True, enforce strict parsing.

Returns:
CHAT
classmethod from_utterances(utterances)[source]

Construct a CHAT reader from a list of utterances.

Creates a new reader containing a single virtual file with the given utterances. Useful for splitting a reader into sub-readers based on utterance boundaries.

Parameters:
utterancesSequence[Utterance]

Utterance objects to include.

Returns:
CHAT
classmethod from_zip(path: str | PathLike[str], *, match: str | None = None, extension='.cha', parallel=True, strict=True)[source]

Read CHAT data from a ZIP file.

Parameters:
pathstr or os.PathLike[str]

Path to the ZIP file.

matchstr, optional

Glob pattern to match filenames within the ZIP.

extensionstr, optional

File extension to match. Default is ".cha".

parallelbool, optional

If True, parse files in parallel.

strictbool, optional

If True, enforce strict parsing.

Returns:
CHAT
head(n=5)[source]

Return the first n utterances with a formatted display.

headers()[source]

Return the headers.

info(verbose=False)[source]

Print summary information.

jyutping(*, by_utterance=False, by_file=False) list[str | None] | list[list[str | None]] | list[list[list[str | None]]][source]

Return the data in Jyutping romanization.

Parameters:
by_utterancebool, optional

If True, return Jyutping grouped by utterance.

by_filebool, optional

If True, return Jyutping grouped by file.

Returns:
list
languages(*, by_file=False)[source]

Return the languages.

property n_files

The number of files.

participants(*, by_file=False)[source]

Return the participants.

search(*, onset=None, nucleus=None, coda=None, tone=None, initial=None, final=None, jyutping=None, character=None, pos=None, word_range=(0, 0), utterance_range=(0, 0), by_token=True, by_utterance=False, by_file=False)[source]

Search the data for the given criteria.

Parameters:
onsetstr, optional

Onset to search for. A regex is supported.

nucleusstr, optional

Nucleus to search for. A regex is supported.

codastr, optional

Coda to search for. A regex is supported.

tonestr, optional

Tone to search for. A regex is supported.

initialstr, optional

Initial to search for. A regex is supported.

finalstr, optional

Final to search for.

jyutpingstr, optional

Jyutping romanization of one Cantonese character to search for.

characterstr, optional

One or more Cantonese characters to search for.

posstr, optional

A part-of-speech tag to search for. A regex is supported.

word_rangetuple[int, int], optional

Span of words around a match. Default is (0, 0).

utterance_rangetuple[int, int], optional

Span of utterances around a match. Default is (0, 0).

by_tokenbool, optional

If True, return Token objects. Otherwise return word strings.

by_utterancebool, optional

If True, return full utterances containing matches.

by_filebool, optional

If True, return data organized by file.

Returns:
list
tail(n=5)[source]

Return the last n utterances with a formatted display.

to_chat(path: str | PathLike[str], *, is_dir=False, filenames=None)[source]

Write the data to CHAT file(s).

Parameters:
pathstr or os.PathLike[str]

Output path.

is_dirbool, optional

If True, write each file to a directory.

filenameslist[str], optional

Filenames for each file.

to_strs()[source]

Return the data as CHAT-formatted strings.

Returns:
list[str]
tokens(*, by_utterance=False, by_file=False) list[Token] | list[list[Token]] | list[list[list[Token]]][source]

Return the tokens.

Parameters:
by_utterancebool, optional

If True, return tokens grouped by utterance.

by_filebool, optional

If True, return tokens grouped by file.

Returns:
list
utterances(*, by_file=False) list[Utterance] | list[list[Utterance]][source]

Return the utterances.

Parameters:
by_filebool, optional

If True, return utterances grouped by file.

Returns:
list[Utterance] | list[list[Utterance]]
word_ngrams(n: int)[source]

Return word n-grams across all utterances.

N-grams do not cross utterance boundaries.

Parameters:
nint

The n-gram order (1 for unigrams, 2 for bigrams, etc.).

Returns:
Ngrams
words(*, by_utterance=False, by_file=False) list[str] | list[list[str]] | list[list[list[str]]][source]

Return the words.

Parameters:
by_utterancebool, optional

If True, return words grouped by utterance.

by_filebool, optional

If True, return words grouped by file.

Returns:
list

Token

class pycantonese.corpus.Token(word, pos=None, jyutping=None, mor=None, gloss=None, gra=None)

A token with Cantonese-specific fields parsed from a CHAT utterance.

Attributes:
gloss
gra
jyutping
mor
pos
word

Methods

to_gra_tier

to_mor_tier

Jyutping

class pycantonese.jyutping.Jyutping(onset: str, nucleus: str, coda: str, tone: str)[source]

Jyutping representation of a Chinese/Cantonese character.

Attributes:
onsetstr

Onset

nucleusstr

Nucleus

codastr

Coda

tonestr

Tone

__eq__(other)

Return self==value.

__hash__()[source]

Return hash(self).

__init__(onset: str, nucleus: str, coda: str, tone: str) None
__repr__()

Return repr(self).

__str__()[source]

Combine onset + nucleus + coda + tone.

property final

Return the final (= nucleus + coda).

Headers

class rustling.chat.Headers

All file-level (non-changeable) headers from a CHAT file.

Attributes:
comments
date
languages
location
media
number
options
other
participants
pid
recording_quality
room_layout
situation
tape_location
time_duration
time_start
transcriber
transcription
types
videos
warning

Ngrams

class rustling.ngram.Ngrams(n, *, min_n=None)

Python-exposed wrapper. Python users see this as Ngrams.

Attributes:
min_n

The minimum n-gram order.

n

The n-gram order.

Methods

clear()

Clear all counts.

count(seq)

Count n-grams from a single sequence.

count_seqs(seqs)

Count n-grams from multiple sequences.

get(ngram)

Return the count for a specific n-gram.

items(*[, order])

Return all (n-gram, count) pairs.

most_common([n, order])

Return the n most common n-grams with their counts.

to_counter(*[, order])

Convert to a Python collections.Counter.

total(*[, order])

Return the total number of n-gram tokens counted.