API Reference

Corpus Data

`read_chat`(path, *[, filter_files, ...])	Read Cantonese CHAT data files.
`hkcancor`()	Create a corpus object for the Hong Kong Cantonese Corpus.
`cantomap`()	Create a corpus object for the CantoMap corpus.
`CHAT`([chat])	A reader for Cantonese CHAT corpus data.
`CHAT.search`(*[, onset, nucleus, coda, tone, ...])	Search the data for the given criteria.

Jyutping Romanization

`characters_to_jyutping`(chars)	Convert Cantonese characters into Jyutping romanization.
`parse_jyutping`(jp_str)	Parse Jyutping romanization into onset, nucleus, coda, and tone.
`jyutping_to_ipa`(jp, *[, onsets, nuclei, ...])	Convert Jyutping romanization into IPA.
`jyutping_to_yale`(jp)	Convert Jyutping romanization into Yale romanization.
`stringify_yale`(yale)	Join Yale words (the output of `jyutping_to_yale()`) into one string.
`yale_to_jyutping`(yale)	Convert Yale romanization into Jyutping romanization.
`jyutping_to_tipa`(jp)	Convert Jyutping romanization into LaTeX TIPA.

Grapheme-to-Phoneme Conversion

g2p(chars, *[, onsets, nuclei, codas, tones])

Convert Cantonese characters into IPA (grapheme-to-phoneme).

Natural Language Processing

`stop_words`([add, remove])	Return Cantonese stop words.
`parse_text`(data, *[, pos_tag_kwargs, ...])	Parse raw Cantonese text.
`segment`(-> list[str])	Segment the unsegmented input.
`pos_tag`(words[, tagset])	Tag the words for their parts of speech.
`pos_tagging.hkcancor_to_ud`([tag])	Map a part-of-speech tag from HKCanCor to Universal Dependencies.

`CHAT`

class pycantonese.CHAT(chat: Chat | None = None)[source]

A reader for Cantonese CHAT corpus data.

This class wraps a Rust-backed CHAT parser and provides Cantonese-specific functionality such as Jyutping extraction, character-level access, and corpus search.

ages()[source]: Return the ages.

append(other)[source]: Append another CHAT object’s data.

characters(*, by_utterance=False, by_file=False) → list[str] | list[list[str]] | list[list[list[str]]][source]

Return the data in individual Chinese characters.

Parameters:

by_utterance (bool, optional) – If True, return characters grouped by utterance.
by_file (bool, optional) – If True, return characters grouped by file.

Returns:

list

extend(others)[source]: Extend with data from multiple CHAT objects.

property file_paths: The file paths.

filter(*, participants=None, files=None)[source]

Filter the data by participants and/or files.

Parameters:

participants (str, optional) – Regex pattern to match participant codes.
files (str, optional) – Glob pattern to match file paths.

Returns:

CHAT

classmethod from_dir(path: str | PathLike[str], *, match: str | None = None, extension='.cha', parallel=True, strict=True, mor_tier='%mor', gra_tier='%gra')[source]

Read CHAT data from a directory.

Parameters:

path (str or os.PathLike[str]) – Path to the directory.
match (str, optional) – Glob pattern to match filenames within the directory.
extension (str, optional) – File extension to match. Default is ".cha".
parallel (bool, optional) – If True, parse files in parallel.
strict (bool, optional) – If True, enforce strict parsing.
mor_tier (str or None, optional) – Name of the dependent tier to treat as the morphology tier, e.g. "%mor" or "%xmor". Default is "%mor". Set to None to disable mor+gra handling.
gra_tier (str or None, optional) – Name of the dependent tier to treat as the grammatical relation tier, e.g. "%gra" or "%xgra". Default is "%gra". Set to None to disable mor+gra handling.

Returns:

CHAT

classmethod from_files(paths: Sequence[str | PathLike[str]], *, parallel=True, strict=True, mor_tier='%mor', gra_tier='%gra')[source]

Read CHAT data from file paths.

Parameters:

paths (Sequence[str | os.PathLike[str]]) – Paths to CHAT files.
parallel (bool, optional) – If True, parse files in parallel.
strict (bool, optional) – If True, enforce strict parsing.
mor_tier (str or None, optional) – Name of the dependent tier to treat as the morphology tier, e.g. "%mor" or "%xmor". Default is "%mor". Set to None to disable mor+gra handling.
gra_tier (str or None, optional) – Name of the dependent tier to treat as the grammatical relation tier, e.g. "%gra" or "%xgra". Default is "%gra". Set to None to disable mor+gra handling.

Returns:

CHAT

classmethod from_git(url: str, *, rev: str | None = None, depth: int | None = None, match: str | None = None, extension='.cha', cache_dir: str | PathLike[str] | None = None, force_download=False, parallel=True, strict=True, mor_tier='%mor', gra_tier='%gra')[source]

Read CHAT data from a Git repository.

Parameters:

url (str) – URL of the Git repository.
rev (str, optional) – Git revision (branch, tag, or commit hash).
depth (int, optional) – Clone depth for shallow clones.
match (str, optional) – Glob pattern to match filenames within the repository.
extension (str, optional) – File extension to match. Default is ".cha".
cache_dir (str or os.PathLike[str], optional) – Directory to cache the cloned repository.
force_download (bool, optional) – If True, force re-download even if cached.
parallel (bool, optional) – If True, parse files in parallel.
strict (bool, optional) – If True, enforce strict parsing.
mor_tier (str or None, optional) – Name of the dependent tier to treat as the morphology tier, e.g. "%mor" or "%xmor". Default is "%mor". Set to None to disable mor+gra handling.
gra_tier (str or None, optional) – Name of the dependent tier to treat as the grammatical relation tier, e.g. "%gra" or "%xgra". Default is "%gra". Set to None to disable mor+gra handling.

Returns:

CHAT

classmethod from_strs(strs, *, ids=None, parallel=True, strict=True, mor_tier='%mor', gra_tier='%gra')[source]

Read CHAT data from strings.

Parameters:

strs (list[str]) – CHAT-formatted strings.
ids (list[str], optional) – Identifiers for each string.
parallel (bool, optional) – If True, parse strings in parallel.
strict (bool, optional) – If True, enforce strict parsing.
mor_tier (str or None, optional) – Name of the dependent tier to treat as the morphology tier, e.g. "%mor" or "%xmor". Default is "%mor". Set to None to disable mor+gra handling.
gra_tier (str or None, optional) – Name of the dependent tier to treat as the grammatical relation tier, e.g. "%gra" or "%xgra". Default is "%gra". Set to None to disable mor+gra handling.

Returns:

CHAT

classmethod from_url(url: str, *, match: str | None = None, extension='.cha', cache_dir: str | PathLike[str] | None = None, force_download=False, parallel=True, strict=True, mor_tier='%mor', gra_tier='%gra')[source]

Read CHAT data from a URL pointing to a ZIP archive.

Parameters:

url (str) – URL of the ZIP archive.
match (str, optional) – Glob pattern to match filenames within the archive.
extension (str, optional) – File extension to match. Default is ".cha".
cache_dir (str or os.PathLike[str], optional) – Directory to cache the downloaded archive.
force_download (bool, optional) – If True, force re-download even if cached.
parallel (bool, optional) – If True, parse files in parallel.
strict (bool, optional) – If True, enforce strict parsing.
mor_tier (str or None, optional) – Name of the dependent tier to treat as the morphology tier, e.g. "%mor" or "%xmor". Default is "%mor". Set to None to disable mor+gra handling.
gra_tier (str or None, optional) – Name of the dependent tier to treat as the grammatical relation tier, e.g. "%gra" or "%xgra". Default is "%gra". Set to None to disable mor+gra handling.

Returns:

CHAT

classmethod from_utterances(utterances)[source]

Construct a CHAT reader from a list of utterances.

Creates a new reader containing a single virtual file with the given utterances. Useful for splitting a reader into sub-readers based on utterance boundaries.

Parameters:: utterances (Sequence[Utterance]) – Utterance objects to include.
Returns:: CHAT

classmethod from_zip(path: str | PathLike[str], *, match: str | None = None, extension='.cha', parallel=True, strict=True, mor_tier='%mor', gra_tier='%gra')[source]

Read CHAT data from a ZIP file.

Parameters:

path (str or os.PathLike[str]) – Path to the ZIP file.
match (str, optional) – Glob pattern to match filenames within the ZIP.
extension (str, optional) – File extension to match. Default is ".cha".
parallel (bool, optional) – If True, parse files in parallel.
strict (bool, optional) – If True, enforce strict parsing.
mor_tier (str or None, optional) – Name of the dependent tier to treat as the morphology tier, e.g. "%mor" or "%xmor". Default is "%mor". Set to None to disable mor+gra handling.
gra_tier (str or None, optional) – Name of the dependent tier to treat as the grammatical relation tier, e.g. "%gra" or "%xgra". Default is "%gra". Set to None to disable mor+gra handling.

Returns:

CHAT

head(n=5)[source]: Return the first n utterances with a formatted display.

headers()[source]: Return the headers.

info(verbose=False)[source]: Print summary information.

Return the data in Jyutping romanization.

Parameters:

by_utterance (bool, optional) – If True, return Jyutping grouped by utterance.
by_file (bool, optional) – If True, return Jyutping grouped by file.

Returns:

list

languages(*, by_file=False)[source]: Return the languages.

property n_files: The number of files.

participants(*, by_file=False)[source]: Return the participants.

search(*, onset=None, nucleus=None, coda=None, tone=None, initial=None, final=None, jyutping=None, character=None, pos=None, word_range=(0, 0), utterance_range=(0, 0), by_token=True, by_utterance=False, by_file=False)[source]

Search the data for the given criteria.

Parameters:

onset (str, optional) – Onset to search for. A regex is supported.
nucleus (str, optional) – Nucleus to search for. A regex is supported.
coda (str, optional) – Coda to search for. A regex is supported.
tone (str, optional) – Tone to search for. A regex is supported.
initial (str, optional) – Initial to search for. A regex is supported.
final (str, optional) – Final to search for.
jyutping (str, optional) – Jyutping romanization of one Cantonese character to search for.
character (str, optional) – One or more Cantonese characters to search for.
pos (str, optional) – A part-of-speech tag to search for. A regex is supported.
word_range (tuple[int, int], optional) – Span of words around a match. Default is (0, 0).
utterance_range (tuple[int, int], optional) – Span of utterances around a match. Default is (0, 0).
by_token (bool, optional) – If True, return Token objects. Otherwise return word strings.
by_utterance (bool, optional) – If True, return full utterances containing matches.
by_file (bool, optional) – If True, return data organized by file.

Returns:

list

tail(n=5)[source]: Return the last n utterances with a formatted display.

to_files(dir_path: str | PathLike[str], *, filenames=None)[source]

Write CHAT (.cha) files to a directory.

Parameters:

dir_path (str or os.PathLike[str]) – Output directory path.
filenames (list[str], optional) – Filenames for each file.

to_strs()[source]

Return the data as CHAT-formatted strings.

Returns:: list[str]

tokens(*, by_utterance=False, by_file=False) → list[Token] | list[list[Token]] | list[list[list[Token]]][source]

Return the tokens.

Parameters:

by_utterance (bool, optional) – If True, return tokens grouped by utterance.
by_file (bool, optional) – If True, return tokens grouped by file.

Returns:

list

utterances(*, by_file=False) → list[Utterance] | list[list[Utterance]][source]

Return the utterances.

Parameters:: by_file (bool, optional) – If True, return utterances grouped by file.
Returns:: list[Utterance] | list[list[Utterance]]

word_ngrams(n: int)[source]

Return word n-grams across all utterances.

N-grams do not cross utterance boundaries.

Parameters:: n (int) – The n-gram order (1 for unigrams, 2 for bigrams, etc.).
Returns:: Ngrams

words(*, by_utterance=False, by_file=False) → list[str] | list[list[str]] | list[list[list[str]]][source]

Return the words.

Parameters:

by_utterance (bool, optional) – If True, return words grouped by utterance.
by_file (bool, optional) – If True, return words grouped by file.

Returns:

list

`Token`

class pycantonese.corpus.Token(word, pos=None, jyutping=None, mor=None, gloss=None, gra=None): A token with Cantonese-specific fields parsed from a CHAT utterance.

`Utterance`

class pycantonese.corpus.Utterance(*, participant, tokens, time_marks=None, tiers=None, audible=None, changeable_header=None, mor_tier_name=Ellipsis, gra_tier_name=Ellipsis): An utterance from CHAT data with preprocessed Cantonese tokens.

`Jyutping`

class pycantonese.jyutping.Jyutping(onset: str, nucleus: str, coda: str, tone: str)[source]

Jyutping representation of a Chinese/Cantonese character.

onset

Onset

Type:: str

nucleus

Nucleus

Type:: str

coda

Coda

Type:: str

tone

Tone

Type:: str

__eq__(other): Return self==value.

__hash__()[source]: Return hash(self).

__init__(onset: str, nucleus: str, coda: str, tone: str) → None

__repr__(): Return repr(self).

__str__()[source]: Combine onset + nucleus + coda + tone.

property final: Return the final (= nucleus + coda).

`Headers`

class rustling.chat.Headers: All file-level (non-changeable) headers from a CHAT file.

`Ngrams`

class rustling.ngram.Ngrams(n, *, min_n=None): Python-exposed wrapper. Python users see this as Ngrams.

API Reference

Corpus Data

Jyutping Romanization

Grapheme-to-Phoneme Conversion

Natural Language Processing

CHAT

Token

Utterance

Jyutping

Headers

Ngrams

`CHAT`

`Token`

`Utterance`

`Jyutping`

`Headers`

`Ngrams`