API Reference
Corpus Data
|
Read Cantonese CHAT data files. |
|
Create a corpus object for the Hong Kong Cantonese Corpus. |
|
A reader for Cantonese CHAT corpus data. |
|
Search the data for the given criteria. |
Jyutping Romanization
|
Convert Cantonese characters into Jyutping romanization. |
|
Parse Jyutping romanization into onset, nucleus, coda, and tone. |
|
Convert Jyutping romanization into IPA. |
|
Convert Jyutping romanization into Yale romanization. |
|
Convert Jyutping romanization into LaTeX TIPA. |
Natural Language Processing
|
Return Cantonese stop words. |
|
Parse raw Cantonese text. |
|
Segment the unsegmented input. |
|
Tag the words for their parts of speech. |
|
Map a part-of-speech tag from HKCanCor to Universal Dependencies. |
CHAT
- class pycantonese.CHAT(chat: Chat | None = None)[source]
A reader for Cantonese CHAT corpus data.
This class wraps a Rust-backed CHAT parser and provides Cantonese-specific functionality such as Jyutping extraction, character-level access, and corpus search.
- Attributes:
file_pathsThe file paths.
n_filesThe number of files.
Methods
ages()Return the ages.
append(other)Append another CHAT object's data.
characters(*[, by_utterance, by_file])Return the data in individual Chinese characters.
extend(others)Extend with data from multiple CHAT objects.
filter(*[, participants, files])Filter the data by participants and/or files.
from_dir(path, *[, match, extension, ...])Read CHAT data from a directory.
from_files(paths, *[, parallel, strict])Read CHAT data from file paths.
from_strs(strs, *[, ids, parallel, strict])Read CHAT data from strings.
from_utterances(utterances)Construct a CHAT reader from a list of utterances.
from_zip(path, *[, match, extension, ...])Read CHAT data from a ZIP file.
head([n])Return the first n utterances with a formatted display.
headers()Return the headers.
info([verbose])Print summary information.
jyutping(*[, by_utterance, by_file])Return the data in Jyutping romanization.
languages(*[, by_file])Return the languages.
participants(*[, by_file])Return the participants.
search(*[, onset, nucleus, coda, tone, ...])Search the data for the given criteria.
tail([n])Return the last n utterances with a formatted display.
to_chat(path, *[, is_dir, filenames])Write the data to CHAT file(s).
to_strs()Return the data as CHAT-formatted strings.
tokens(*[, by_utterance, by_file])Return the tokens.
utterances(*[, by_file])Return the utterances.
word_ngrams(n)Return word n-grams across all utterances.
words(*[, by_utterance, by_file])Return the words.
- characters(*, by_utterance=False, by_file=False) list[str] | list[list[str]] | list[list[list[str]]][source]
Return the data in individual Chinese characters.
- Parameters:
- by_utterancebool, optional
If True, return characters grouped by utterance.
- by_filebool, optional
If True, return characters grouped by file.
- Returns:
- list
- property file_paths
The file paths.
- filter(*, participants=None, files=None)[source]
Filter the data by participants and/or files.
- Parameters:
- participantsstr, optional
Regex pattern to match participant codes.
- filesstr, optional
Glob pattern to match file paths.
- Returns:
- CHAT
- classmethod from_dir(path: str | PathLike[str], *, match: str | None = None, extension='.cha', parallel=True, strict=True)[source]
Read CHAT data from a directory.
- Parameters:
- pathstr or os.PathLike[str]
Path to the directory.
- matchstr, optional
Glob pattern to match filenames within the directory.
- extensionstr, optional
File extension to match. Default is
".cha".- parallelbool, optional
If True, parse files in parallel.
- strictbool, optional
If True, enforce strict parsing.
- Returns:
- classmethod from_files(paths: Sequence[str | PathLike[str]], *, parallel=True, strict=True)[source]
Read CHAT data from file paths.
- Parameters:
- pathsSequence[str | os.PathLike[str]]
Paths to CHAT files.
- parallelbool, optional
If True, parse files in parallel.
- strictbool, optional
If True, enforce strict parsing.
- Returns:
- classmethod from_strs(strs, *, ids=None, parallel=True, strict=True)[source]
Read CHAT data from strings.
- Parameters:
- strslist[str]
CHAT-formatted strings.
- idslist[str], optional
Identifiers for each string.
- parallelbool, optional
If True, parse strings in parallel.
- strictbool, optional
If True, enforce strict parsing.
- Returns:
- classmethod from_utterances(utterances)[source]
Construct a CHAT reader from a list of utterances.
Creates a new reader containing a single virtual file with the given utterances. Useful for splitting a reader into sub-readers based on utterance boundaries.
- Parameters:
- utterancesSequence[Utterance]
Utterance objects to include.
- Returns:
- classmethod from_zip(path: str | PathLike[str], *, match: str | None = None, extension='.cha', parallel=True, strict=True)[source]
Read CHAT data from a ZIP file.
- Parameters:
- pathstr or os.PathLike[str]
Path to the ZIP file.
- matchstr, optional
Glob pattern to match filenames within the ZIP.
- extensionstr, optional
File extension to match. Default is
".cha".- parallelbool, optional
If True, parse files in parallel.
- strictbool, optional
If True, enforce strict parsing.
- Returns:
- jyutping(*, by_utterance=False, by_file=False) list[str | None] | list[list[str | None]] | list[list[list[str | None]]][source]
Return the data in Jyutping romanization.
- Parameters:
- by_utterancebool, optional
If True, return Jyutping grouped by utterance.
- by_filebool, optional
If True, return Jyutping grouped by file.
- Returns:
- list
- property n_files
The number of files.
- search(*, onset=None, nucleus=None, coda=None, tone=None, initial=None, final=None, jyutping=None, character=None, pos=None, word_range=(0, 0), utterance_range=(0, 0), by_token=True, by_utterance=False, by_file=False)[source]
Search the data for the given criteria.
- Parameters:
- onsetstr, optional
Onset to search for. A regex is supported.
- nucleusstr, optional
Nucleus to search for. A regex is supported.
- codastr, optional
Coda to search for. A regex is supported.
- tonestr, optional
Tone to search for. A regex is supported.
- initialstr, optional
Initial to search for. A regex is supported.
- finalstr, optional
Final to search for.
- jyutpingstr, optional
Jyutping romanization of one Cantonese character to search for.
- characterstr, optional
One or more Cantonese characters to search for.
- posstr, optional
A part-of-speech tag to search for. A regex is supported.
- word_rangetuple[int, int], optional
Span of words around a match. Default is
(0, 0).- utterance_rangetuple[int, int], optional
Span of utterances around a match. Default is
(0, 0).- by_tokenbool, optional
If True, return Token objects. Otherwise return word strings.
- by_utterancebool, optional
If True, return full utterances containing matches.
- by_filebool, optional
If True, return data organized by file.
- Returns:
- list
- to_chat(path: str | PathLike[str], *, is_dir=False, filenames=None)[source]
Write the data to CHAT file(s).
- Parameters:
- pathstr or os.PathLike[str]
Output path.
- is_dirbool, optional
If True, write each file to a directory.
- filenameslist[str], optional
Filenames for each file.
- tokens(*, by_utterance=False, by_file=False) list[Token] | list[list[Token]] | list[list[list[Token]]][source]
Return the tokens.
- Parameters:
- by_utterancebool, optional
If True, return tokens grouped by utterance.
- by_filebool, optional
If True, return tokens grouped by file.
- Returns:
- list
- utterances(*, by_file=False) list[Utterance] | list[list[Utterance]][source]
Return the utterances.
- Parameters:
- by_filebool, optional
If True, return utterances grouped by file.
- Returns:
- list[Utterance] | list[list[Utterance]]
Token
- class pycantonese.corpus.Token(word, pos=None, jyutping=None, mor=None, gloss=None, gra=None)
A token with Cantonese-specific fields parsed from a CHAT utterance.
- Attributes:
- gloss
- gra
- jyutping
- mor
- pos
- word
Methods
to_gra_tier
to_mor_tier
Jyutping
- class pycantonese.jyutping.Jyutping(onset: str, nucleus: str, coda: str, tone: str)[source]
Jyutping representation of a Chinese/Cantonese character.
- Attributes:
- onsetstr
Onset
- nucleusstr
Nucleus
- codastr
Coda
- tonestr
Tone
- __eq__(other)
Return self==value.
- __repr__()
Return repr(self).
- property final
Return the final (= nucleus + coda).
Headers
- class rustling.chat.Headers
All file-level (non-changeable) headers from a CHAT file.
- Attributes:
- comments
- date
- languages
- location
- media
- number
- options
- other
- participants
- pid
- recording_quality
- room_layout
- situation
- tape_location
- time_duration
- time_start
- transcriber
- transcription
- types
- videos
- warning
Ngrams
- class rustling.ngram.Ngrams(n, *, min_n=None)
Python-exposed wrapper. Python users see this as Ngrams.
Methods
clear()Clear all counts.
count(seq)Count n-grams from a single sequence.
count_seqs(seqs)Count n-grams from multiple sequences.
get(ngram)Return the count for a specific n-gram.
items(*[, order])Return all (n-gram, count) pairs.
most_common([n, order])Return the n most common n-grams with their counts.
to_counter(*[, order])Convert to a Python
collections.Counter.total(*[, order])Return the total number of n-gram tokens counted.