API Reference
Corpus Data
|
Read Cantonese CHAT data files. |
|
Create a corpus object for the Hong Kong Cantonese Corpus. |
|
Create a corpus object for the CantoMap corpus. |
|
A reader for Cantonese CHAT corpus data. |
|
Search the data for the given criteria. |
Jyutping Romanization
|
Convert Cantonese characters into Jyutping romanization. |
|
Parse Jyutping romanization into onset, nucleus, coda, and tone. |
|
Convert Jyutping romanization into IPA. |
|
Convert Jyutping romanization into Yale romanization. |
|
Join Yale words (the output of |
|
Convert Yale romanization into Jyutping romanization. |
|
Convert Jyutping romanization into LaTeX TIPA. |
Grapheme-to-Phoneme Conversion
|
Convert Cantonese characters into IPA (grapheme-to-phoneme). |
Natural Language Processing
|
Return Cantonese stop words. |
|
Parse raw Cantonese text. |
|
Segment the unsegmented input. |
|
Tag the words for their parts of speech. |
|
Map a part-of-speech tag from HKCanCor to Universal Dependencies. |
CHAT
- class pycantonese.CHAT(chat: Chat | None = None)[source]
A reader for Cantonese CHAT corpus data.
This class wraps a Rust-backed CHAT parser and provides Cantonese-specific functionality such as Jyutping extraction, character-level access, and corpus search.
- characters(*, by_utterance=False, by_file=False) list[str] | list[list[str]] | list[list[list[str]]][source]
Return the data in individual Chinese characters.
- property file_paths
The file paths.
- classmethod from_dir(path: str | PathLike[str], *, match: str | None = None, extension='.cha', parallel=True, strict=True, mor_tier='%mor', gra_tier='%gra')[source]
Read CHAT data from a directory.
- Parameters:
path (str or os.PathLike[str]) – Path to the directory.
match (str, optional) – Glob pattern to match filenames within the directory.
extension (str, optional) – File extension to match. Default is
".cha".parallel (bool, optional) – If True, parse files in parallel.
strict (bool, optional) – If True, enforce strict parsing.
mor_tier (str or None, optional) – Name of the dependent tier to treat as the morphology tier, e.g.
"%mor"or"%xmor". Default is"%mor". Set to None to disable mor+gra handling.gra_tier (str or None, optional) – Name of the dependent tier to treat as the grammatical relation tier, e.g.
"%gra"or"%xgra". Default is"%gra". Set to None to disable mor+gra handling.
- Returns:
- classmethod from_files(paths: Sequence[str | PathLike[str]], *, parallel=True, strict=True, mor_tier='%mor', gra_tier='%gra')[source]
Read CHAT data from file paths.
- Parameters:
paths (Sequence[str | os.PathLike[str]]) – Paths to CHAT files.
parallel (bool, optional) – If True, parse files in parallel.
strict (bool, optional) – If True, enforce strict parsing.
mor_tier (str or None, optional) – Name of the dependent tier to treat as the morphology tier, e.g.
"%mor"or"%xmor". Default is"%mor". Set to None to disable mor+gra handling.gra_tier (str or None, optional) – Name of the dependent tier to treat as the grammatical relation tier, e.g.
"%gra"or"%xgra". Default is"%gra". Set to None to disable mor+gra handling.
- Returns:
- classmethod from_git(url: str, *, rev: str | None = None, depth: int | None = None, match: str | None = None, extension='.cha', cache_dir: str | PathLike[str] | None = None, force_download=False, parallel=True, strict=True, mor_tier='%mor', gra_tier='%gra')[source]
Read CHAT data from a Git repository.
- Parameters:
url (str) – URL of the Git repository.
rev (str, optional) – Git revision (branch, tag, or commit hash).
depth (int, optional) – Clone depth for shallow clones.
match (str, optional) – Glob pattern to match filenames within the repository.
extension (str, optional) – File extension to match. Default is
".cha".cache_dir (str or os.PathLike[str], optional) – Directory to cache the cloned repository.
force_download (bool, optional) – If True, force re-download even if cached.
parallel (bool, optional) – If True, parse files in parallel.
strict (bool, optional) – If True, enforce strict parsing.
mor_tier (str or None, optional) – Name of the dependent tier to treat as the morphology tier, e.g.
"%mor"or"%xmor". Default is"%mor". Set to None to disable mor+gra handling.gra_tier (str or None, optional) – Name of the dependent tier to treat as the grammatical relation tier, e.g.
"%gra"or"%xgra". Default is"%gra". Set to None to disable mor+gra handling.
- Returns:
- classmethod from_strs(strs, *, ids=None, parallel=True, strict=True, mor_tier='%mor', gra_tier='%gra')[source]
Read CHAT data from strings.
- Parameters:
parallel (bool, optional) – If True, parse strings in parallel.
strict (bool, optional) – If True, enforce strict parsing.
mor_tier (str or None, optional) – Name of the dependent tier to treat as the morphology tier, e.g.
"%mor"or"%xmor". Default is"%mor". Set to None to disable mor+gra handling.gra_tier (str or None, optional) – Name of the dependent tier to treat as the grammatical relation tier, e.g.
"%gra"or"%xgra". Default is"%gra". Set to None to disable mor+gra handling.
- Returns:
- classmethod from_url(url: str, *, match: str | None = None, extension='.cha', cache_dir: str | PathLike[str] | None = None, force_download=False, parallel=True, strict=True, mor_tier='%mor', gra_tier='%gra')[source]
Read CHAT data from a URL pointing to a ZIP archive.
- Parameters:
url (str) – URL of the ZIP archive.
match (str, optional) – Glob pattern to match filenames within the archive.
extension (str, optional) – File extension to match. Default is
".cha".cache_dir (str or os.PathLike[str], optional) – Directory to cache the downloaded archive.
force_download (bool, optional) – If True, force re-download even if cached.
parallel (bool, optional) – If True, parse files in parallel.
strict (bool, optional) – If True, enforce strict parsing.
mor_tier (str or None, optional) – Name of the dependent tier to treat as the morphology tier, e.g.
"%mor"or"%xmor". Default is"%mor". Set to None to disable mor+gra handling.gra_tier (str or None, optional) – Name of the dependent tier to treat as the grammatical relation tier, e.g.
"%gra"or"%xgra". Default is"%gra". Set to None to disable mor+gra handling.
- Returns:
- classmethod from_utterances(utterances)[source]
Construct a CHAT reader from a list of utterances.
Creates a new reader containing a single virtual file with the given utterances. Useful for splitting a reader into sub-readers based on utterance boundaries.
- classmethod from_zip(path: str | PathLike[str], *, match: str | None = None, extension='.cha', parallel=True, strict=True, mor_tier='%mor', gra_tier='%gra')[source]
Read CHAT data from a ZIP file.
- Parameters:
path (str or os.PathLike[str]) – Path to the ZIP file.
match (str, optional) – Glob pattern to match filenames within the ZIP.
extension (str, optional) – File extension to match. Default is
".cha".parallel (bool, optional) – If True, parse files in parallel.
strict (bool, optional) – If True, enforce strict parsing.
mor_tier (str or None, optional) – Name of the dependent tier to treat as the morphology tier, e.g.
"%mor"or"%xmor". Default is"%mor". Set to None to disable mor+gra handling.gra_tier (str or None, optional) – Name of the dependent tier to treat as the grammatical relation tier, e.g.
"%gra"or"%xgra". Default is"%gra". Set to None to disable mor+gra handling.
- Returns:
- jyutping(*, by_utterance=False, by_file=False) list[str | None] | list[list[str | None]] | list[list[list[str | None]]][source]
Return the data in Jyutping romanization.
- property n_files
The number of files.
- search(*, onset=None, nucleus=None, coda=None, tone=None, initial=None, final=None, jyutping=None, character=None, pos=None, word_range=(0, 0), utterance_range=(0, 0), by_token=True, by_utterance=False, by_file=False)[source]
Search the data for the given criteria.
- Parameters:
onset (str, optional) – Onset to search for. A regex is supported.
nucleus (str, optional) – Nucleus to search for. A regex is supported.
coda (str, optional) – Coda to search for. A regex is supported.
tone (str, optional) – Tone to search for. A regex is supported.
initial (str, optional) – Initial to search for. A regex is supported.
final (str, optional) – Final to search for.
jyutping (str, optional) – Jyutping romanization of one Cantonese character to search for.
character (str, optional) – One or more Cantonese characters to search for.
pos (str, optional) – A part-of-speech tag to search for. A regex is supported.
word_range (tuple[int, int], optional) – Span of words around a match. Default is
(0, 0).utterance_range (tuple[int, int], optional) – Span of utterances around a match. Default is
(0, 0).by_token (bool, optional) – If True, return Token objects. Otherwise return word strings.
by_utterance (bool, optional) – If True, return full utterances containing matches.
by_file (bool, optional) – If True, return data organized by file.
- Returns:
list
- to_files(dir_path: str | PathLike[str], *, filenames=None)[source]
Write CHAT (.cha) files to a directory.
- Parameters:
dir_path (str or os.PathLike[str]) – Output directory path.
- tokens(*, by_utterance=False, by_file=False) list[Token] | list[list[Token]] | list[list[list[Token]]][source]
Return the tokens.
- utterances(*, by_file=False) list[Utterance] | list[list[Utterance]][source]
Return the utterances.
- Parameters:
by_file (bool, optional) – If True, return utterances grouped by file.
- Returns:
list[Utterance] | list[list[Utterance]]
Token
- class pycantonese.corpus.Token(word, pos=None, jyutping=None, mor=None, gloss=None, gra=None)
A token with Cantonese-specific fields parsed from a CHAT utterance.
Utterance
- class pycantonese.corpus.Utterance(*, participant, tokens, time_marks=None, tiers=None, audible=None, changeable_header=None, mor_tier_name=Ellipsis, gra_tier_name=Ellipsis)
An utterance from CHAT data with preprocessed Cantonese tokens.
Jyutping
Headers
- class rustling.chat.Headers
All file-level (non-changeable) headers from a CHAT file.
Ngrams
- class rustling.ngram.Ngrams(n, *, min_n=None)
Python-exposed wrapper. Python users see this as Ngrams.