API Reference

Corpus Data

read_chat(path, *[, filter_files, ...])

Read Cantonese CHAT data files.

hkcancor()

Create a corpus object for the Hong Kong Cantonese Corpus.

cantomap()

Create a corpus object for the CantoMap corpus.

CHAT([chat])

A reader for Cantonese CHAT corpus data.

CHAT.search(*[, onset, nucleus, coda, tone, ...])

Search the data for the given criteria.

Jyutping Romanization

characters_to_jyutping(chars)

Convert Cantonese characters into Jyutping romanization.

parse_jyutping(jp_str)

Parse Jyutping romanization into onset, nucleus, coda, and tone.

jyutping_to_ipa(jp, *[, onsets, nuclei, ...])

Convert Jyutping romanization into IPA.

jyutping_to_yale(jp)

Convert Jyutping romanization into Yale romanization.

stringify_yale(yale)

Join Yale words (the output of jyutping_to_yale()) into one string.

yale_to_jyutping(yale)

Convert Yale romanization into Jyutping romanization.

jyutping_to_tipa(jp)

Convert Jyutping romanization into LaTeX TIPA.

Grapheme-to-Phoneme Conversion

g2p(chars, *[, onsets, nuclei, codas, tones])

Convert Cantonese characters into IPA (grapheme-to-phoneme).

Natural Language Processing

stop_words([add, remove])

Return Cantonese stop words.

parse_text(data, *[, pos_tag_kwargs, ...])

Parse raw Cantonese text.

segment(-> list[str])

Segment the unsegmented input.

pos_tag(words[, tagset])

Tag the words for their parts of speech.

pos_tagging.hkcancor_to_ud([tag])

Map a part-of-speech tag from HKCanCor to Universal Dependencies.

CHAT

class pycantonese.CHAT(chat: Chat | None = None)[source]

A reader for Cantonese CHAT corpus data.

This class wraps a Rust-backed CHAT parser and provides Cantonese-specific functionality such as Jyutping extraction, character-level access, and corpus search.

ages()[source]

Return the ages.

append(other)[source]

Append another CHAT object’s data.

characters(*, by_utterance=False, by_file=False) list[str] | list[list[str]] | list[list[list[str]]][source]

Return the data in individual Chinese characters.

Parameters:
  • by_utterance (bool, optional) – If True, return characters grouped by utterance.

  • by_file (bool, optional) – If True, return characters grouped by file.

Returns:

list

extend(others)[source]

Extend with data from multiple CHAT objects.

property file_paths

The file paths.

filter(*, participants=None, files=None)[source]

Filter the data by participants and/or files.

Parameters:
  • participants (str, optional) – Regex pattern to match participant codes.

  • files (str, optional) – Glob pattern to match file paths.

Returns:

CHAT

classmethod from_dir(path: str | PathLike[str], *, match: str | None = None, extension='.cha', parallel=True, strict=True, mor_tier='%mor', gra_tier='%gra')[source]

Read CHAT data from a directory.

Parameters:
  • path (str or os.PathLike[str]) – Path to the directory.

  • match (str, optional) – Glob pattern to match filenames within the directory.

  • extension (str, optional) – File extension to match. Default is ".cha".

  • parallel (bool, optional) – If True, parse files in parallel.

  • strict (bool, optional) – If True, enforce strict parsing.

  • mor_tier (str or None, optional) – Name of the dependent tier to treat as the morphology tier, e.g. "%mor" or "%xmor". Default is "%mor". Set to None to disable mor+gra handling.

  • gra_tier (str or None, optional) – Name of the dependent tier to treat as the grammatical relation tier, e.g. "%gra" or "%xgra". Default is "%gra". Set to None to disable mor+gra handling.

Returns:

CHAT

classmethod from_files(paths: Sequence[str | PathLike[str]], *, parallel=True, strict=True, mor_tier='%mor', gra_tier='%gra')[source]

Read CHAT data from file paths.

Parameters:
  • paths (Sequence[str | os.PathLike[str]]) – Paths to CHAT files.

  • parallel (bool, optional) – If True, parse files in parallel.

  • strict (bool, optional) – If True, enforce strict parsing.

  • mor_tier (str or None, optional) – Name of the dependent tier to treat as the morphology tier, e.g. "%mor" or "%xmor". Default is "%mor". Set to None to disable mor+gra handling.

  • gra_tier (str or None, optional) – Name of the dependent tier to treat as the grammatical relation tier, e.g. "%gra" or "%xgra". Default is "%gra". Set to None to disable mor+gra handling.

Returns:

CHAT

classmethod from_git(url: str, *, rev: str | None = None, depth: int | None = None, match: str | None = None, extension='.cha', cache_dir: str | PathLike[str] | None = None, force_download=False, parallel=True, strict=True, mor_tier='%mor', gra_tier='%gra')[source]

Read CHAT data from a Git repository.

Parameters:
  • url (str) – URL of the Git repository.

  • rev (str, optional) – Git revision (branch, tag, or commit hash).

  • depth (int, optional) – Clone depth for shallow clones.

  • match (str, optional) – Glob pattern to match filenames within the repository.

  • extension (str, optional) – File extension to match. Default is ".cha".

  • cache_dir (str or os.PathLike[str], optional) – Directory to cache the cloned repository.

  • force_download (bool, optional) – If True, force re-download even if cached.

  • parallel (bool, optional) – If True, parse files in parallel.

  • strict (bool, optional) – If True, enforce strict parsing.

  • mor_tier (str or None, optional) – Name of the dependent tier to treat as the morphology tier, e.g. "%mor" or "%xmor". Default is "%mor". Set to None to disable mor+gra handling.

  • gra_tier (str or None, optional) – Name of the dependent tier to treat as the grammatical relation tier, e.g. "%gra" or "%xgra". Default is "%gra". Set to None to disable mor+gra handling.

Returns:

CHAT

classmethod from_strs(strs, *, ids=None, parallel=True, strict=True, mor_tier='%mor', gra_tier='%gra')[source]

Read CHAT data from strings.

Parameters:
  • strs (list[str]) – CHAT-formatted strings.

  • ids (list[str], optional) – Identifiers for each string.

  • parallel (bool, optional) – If True, parse strings in parallel.

  • strict (bool, optional) – If True, enforce strict parsing.

  • mor_tier (str or None, optional) – Name of the dependent tier to treat as the morphology tier, e.g. "%mor" or "%xmor". Default is "%mor". Set to None to disable mor+gra handling.

  • gra_tier (str or None, optional) – Name of the dependent tier to treat as the grammatical relation tier, e.g. "%gra" or "%xgra". Default is "%gra". Set to None to disable mor+gra handling.

Returns:

CHAT

classmethod from_url(url: str, *, match: str | None = None, extension='.cha', cache_dir: str | PathLike[str] | None = None, force_download=False, parallel=True, strict=True, mor_tier='%mor', gra_tier='%gra')[source]

Read CHAT data from a URL pointing to a ZIP archive.

Parameters:
  • url (str) – URL of the ZIP archive.

  • match (str, optional) – Glob pattern to match filenames within the archive.

  • extension (str, optional) – File extension to match. Default is ".cha".

  • cache_dir (str or os.PathLike[str], optional) – Directory to cache the downloaded archive.

  • force_download (bool, optional) – If True, force re-download even if cached.

  • parallel (bool, optional) – If True, parse files in parallel.

  • strict (bool, optional) – If True, enforce strict parsing.

  • mor_tier (str or None, optional) – Name of the dependent tier to treat as the morphology tier, e.g. "%mor" or "%xmor". Default is "%mor". Set to None to disable mor+gra handling.

  • gra_tier (str or None, optional) – Name of the dependent tier to treat as the grammatical relation tier, e.g. "%gra" or "%xgra". Default is "%gra". Set to None to disable mor+gra handling.

Returns:

CHAT

classmethod from_utterances(utterances)[source]

Construct a CHAT reader from a list of utterances.

Creates a new reader containing a single virtual file with the given utterances. Useful for splitting a reader into sub-readers based on utterance boundaries.

Parameters:

utterances (Sequence[Utterance]) – Utterance objects to include.

Returns:

CHAT

classmethod from_zip(path: str | PathLike[str], *, match: str | None = None, extension='.cha', parallel=True, strict=True, mor_tier='%mor', gra_tier='%gra')[source]

Read CHAT data from a ZIP file.

Parameters:
  • path (str or os.PathLike[str]) – Path to the ZIP file.

  • match (str, optional) – Glob pattern to match filenames within the ZIP.

  • extension (str, optional) – File extension to match. Default is ".cha".

  • parallel (bool, optional) – If True, parse files in parallel.

  • strict (bool, optional) – If True, enforce strict parsing.

  • mor_tier (str or None, optional) – Name of the dependent tier to treat as the morphology tier, e.g. "%mor" or "%xmor". Default is "%mor". Set to None to disable mor+gra handling.

  • gra_tier (str or None, optional) – Name of the dependent tier to treat as the grammatical relation tier, e.g. "%gra" or "%xgra". Default is "%gra". Set to None to disable mor+gra handling.

Returns:

CHAT

head(n=5)[source]

Return the first n utterances with a formatted display.

headers()[source]

Return the headers.

info(verbose=False)[source]

Print summary information.

jyutping(*, by_utterance=False, by_file=False) list[str | None] | list[list[str | None]] | list[list[list[str | None]]][source]

Return the data in Jyutping romanization.

Parameters:
  • by_utterance (bool, optional) – If True, return Jyutping grouped by utterance.

  • by_file (bool, optional) – If True, return Jyutping grouped by file.

Returns:

list

languages(*, by_file=False)[source]

Return the languages.

property n_files

The number of files.

participants(*, by_file=False)[source]

Return the participants.

search(*, onset=None, nucleus=None, coda=None, tone=None, initial=None, final=None, jyutping=None, character=None, pos=None, word_range=(0, 0), utterance_range=(0, 0), by_token=True, by_utterance=False, by_file=False)[source]

Search the data for the given criteria.

Parameters:
  • onset (str, optional) – Onset to search for. A regex is supported.

  • nucleus (str, optional) – Nucleus to search for. A regex is supported.

  • coda (str, optional) – Coda to search for. A regex is supported.

  • tone (str, optional) – Tone to search for. A regex is supported.

  • initial (str, optional) – Initial to search for. A regex is supported.

  • final (str, optional) – Final to search for.

  • jyutping (str, optional) – Jyutping romanization of one Cantonese character to search for.

  • character (str, optional) – One or more Cantonese characters to search for.

  • pos (str, optional) – A part-of-speech tag to search for. A regex is supported.

  • word_range (tuple[int, int], optional) – Span of words around a match. Default is (0, 0).

  • utterance_range (tuple[int, int], optional) – Span of utterances around a match. Default is (0, 0).

  • by_token (bool, optional) – If True, return Token objects. Otherwise return word strings.

  • by_utterance (bool, optional) – If True, return full utterances containing matches.

  • by_file (bool, optional) – If True, return data organized by file.

Returns:

list

tail(n=5)[source]

Return the last n utterances with a formatted display.

to_files(dir_path: str | PathLike[str], *, filenames=None)[source]

Write CHAT (.cha) files to a directory.

Parameters:
  • dir_path (str or os.PathLike[str]) – Output directory path.

  • filenames (list[str], optional) – Filenames for each file.

to_strs()[source]

Return the data as CHAT-formatted strings.

Returns:

list[str]

tokens(*, by_utterance=False, by_file=False) list[Token] | list[list[Token]] | list[list[list[Token]]][source]

Return the tokens.

Parameters:
  • by_utterance (bool, optional) – If True, return tokens grouped by utterance.

  • by_file (bool, optional) – If True, return tokens grouped by file.

Returns:

list

utterances(*, by_file=False) list[Utterance] | list[list[Utterance]][source]

Return the utterances.

Parameters:

by_file (bool, optional) – If True, return utterances grouped by file.

Returns:

list[Utterance] | list[list[Utterance]]

word_ngrams(n: int)[source]

Return word n-grams across all utterances.

N-grams do not cross utterance boundaries.

Parameters:

n (int) – The n-gram order (1 for unigrams, 2 for bigrams, etc.).

Returns:

Ngrams

words(*, by_utterance=False, by_file=False) list[str] | list[list[str]] | list[list[list[str]]][source]

Return the words.

Parameters:
  • by_utterance (bool, optional) – If True, return words grouped by utterance.

  • by_file (bool, optional) – If True, return words grouped by file.

Returns:

list

Token

class pycantonese.corpus.Token(word, pos=None, jyutping=None, mor=None, gloss=None, gra=None)

A token with Cantonese-specific fields parsed from a CHAT utterance.

Utterance

class pycantonese.corpus.Utterance(*, participant, tokens, time_marks=None, tiers=None, audible=None, changeable_header=None, mor_tier_name=Ellipsis, gra_tier_name=Ellipsis)

An utterance from CHAT data with preprocessed Cantonese tokens.

Jyutping

class pycantonese.jyutping.Jyutping(onset: str, nucleus: str, coda: str, tone: str)[source]

Jyutping representation of a Chinese/Cantonese character.

onset

Onset

Type:

str

nucleus

Nucleus

Type:

str

coda

Coda

Type:

str

tone

Tone

Type:

str

__eq__(other)

Return self==value.

__hash__()[source]

Return hash(self).

__init__(onset: str, nucleus: str, coda: str, tone: str) None
__repr__()

Return repr(self).

__str__()[source]

Combine onset + nucleus + coda + tone.

property final

Return the final (= nucleus + coda).

Headers

class rustling.chat.Headers

All file-level (non-changeable) headers from a CHAT file.

Ngrams

class rustling.ngram.Ngrams(n, *, min_n=None)

Python-exposed wrapper. Python users see this as Ngrams.