Corpus Reader Methods

After you have created a corpus reader (see Corpus Data), the headers, transcriptions, and annotations are all accessible through the methods of the corpus reader object.

Let’s say we have a corpus reader with the built-in HKCanCor data.

import pycantonese
corpus = pycantonese.hkcancor()

Headers

A CHAT data file typically has metadata around the top of the file, with lines that begin with @. The metadata include participants’ demographics (age, gender, etc) and languages used in the data.

Specifically for HKCanCor, the participants in all the data files are anonymous. In PyCantonese’s rendition of HKCanCor, their names are simply placeholders such as A and B, and their corresponding three-letter codes are XXA, XXB, etc. In contrast, many CHILDES and TalkBank datasets have their participants identified. By convention, the target child’s code is CHI, the child’s mother’s MOT, and the child’s father’s FAT.

Since PyCantonese uses Rustling to parse CHAT data files, the way in which header information is accessed is identical between the two packages. Please see Rustling’s documentation on headers.

To see how a header from HKCanCor translates to its representation in PyCantonese, here is the header from FC-001_v2.cha, the first (by filename) of the 58 CHAT files:

@UTF8
@Begin
@Languages: yue , eng
@Participants:      XXA A Adult , XXB B Adult
@ID:        yue , eng|HKCanCor|XXA|34;|female|||Adult||origin:HK|
@ID:        yue , eng|HKCanCor|XXB|37;|female|||Adult||origin:HK|
@Date:      30-APR-1997
@Tape Number:       001

In this example, this recording session was between two Hong Kong female speakers (ages 34 and 37), recorded on April 30th, 1997. The languages in this data file are both Cantonese and English (in that order of usage frequency; the ordering in yue , eng is meaningful).

Through the corpus reader object corpus we’ve just created, we see the same information by calling the method headers() (which returns a list of Headers objects; [0] gets the first one that corresponds to FC-001_v2.cha):

corpus.headers()[0]
# Headers(languages=["yue", "eng"], participants=[...2], date=Some("30-APR-1997"))

Here are the currently implemented methods for header information:

ages()

Return the ages.

headers()

Return the headers.

languages(*[, by_file])

Return the languages.

participants(*[, by_file])

Return the participants.

Transcriptions and Annotations

A PyCantonese corpus reader is an instance of the CHAT class. While this class inherits the CHAT handling capabilities from the underlying Rustling package, CHAT has several additional functionalities to deal with Cantonese-specific elements, particularly Jyutping romanization and Chinese characters.

CHAT has convenience methods to give you an overview of the data in the reader.

info([verbose])

Print summary information.

head([n])

Return the first n utterances with a formatted display.

tail([n])

Return the last n utterances with a formatted display.

corpus.info()
## 58 file(s), 16162 utterance(s)
corpus.head()
# *XXA:  喂       遲      啲      去        唔     去        旅行             啊      ?
# %mor:  e|wai3  a|ci4  u|di1  v|heoi3  d|m4  v|heoi3  vn|leoi5hang4  y|aa3  ?
#
# *XXA:  你       老公           有冇           平        機票          啊      ?
# %mor:  r|nei5  n|lou5gung1  v1|jau5mou5  a|peng4  n|gei1piu3  y|aa3  ?
#
# *XXB:  平        機票          要        淡季             先       有得           平        𡃉       喎      .
# %mor:  a|peng4  n|gei1piu3  vu|jiu3  an|daam6gwai3  d|sin1  vu|jau5dak1  a|peng4  y|gaa3  y|wo3  .
#
# *XXB:  而家         旺        -  .
# %mor:  t|ji4gaa1  a|wong6  -  .
#
# *XXA:  冇得           去        嗱       .
# %mor:  vu|mou5dak1  v|heoi3  y|laa4  .
#

Here are the major CHAT methods to access data at different levels of data structure:

words(*[, by_utterance, by_file])

Return the words.

tokens(*[, by_utterance, by_file])

Return the tokens.

utterances(*[, by_file])

Return the utterances.

Words are the usual text strings. Think of tokens as words but with annotations (part-of-speech tags, morphological information, etc). An utterance is a list of tokens plus associated information (the participant of the utterance, time markers if there are associated audio-visual materials, etc).

corpus.words()[:10]
# ['喂', '遲', '啲', '去', '唔', '去', '旅行', '啊', '?', '你']

corpus.tokens()[:10]
# [Token(word='喂', pos='e', jyutping='wai3', mor=None, gloss=None, gra=None),
#  Token(word='遲', pos='a', jyutping='ci4', mor=None, gloss=None, gra=None),
#  Token(word='啲', pos='u', jyutping='di1', mor=None, gloss=None, gra=None),
#  Token(word='去', pos='v', jyutping='heoi3', mor=None, gloss=None, gra=None),
#  Token(word='唔', pos='d', jyutping='m4', mor=None, gloss=None, gra=None),
#  Token(word='去', pos='v', jyutping='heoi3', mor=None, gloss=None, gra=None),
#  Token(word='旅行', pos='vn', jyutping='leoi5hang4', mor=None, gloss=None, gra=None),
#  Token(word='啊', pos='y', jyutping='aa3', mor=None, gloss=None, gra=None),
#  Token(word='?', pos='', jyutping=None, mor=None, gloss=None, gra=None),
#  Token(word='你', pos='r', jyutping='nei5', mor=None, gloss=None, gra=None)]

corpus.utterances()[:1]
# [Utterance(participant='XXA', tokens=[...9 tokens], time_marks=None)]

PyCantonese has an augmented representation of tokens, where Jyutping romanization and glosses have their own dedicated attributes.

Jyutping Romanization

Tokens, as annotated words, are instances of the Token class. A Token instance has the PyCantonese-specific attribute jyutping to accommodate Jyutping romanization.

To illustrate, below is the first utterance in FC-001_v2.cha, where Jyutping romanization is found in the %mor tier:

*XXA:       喂 遲 啲 去 唔 去 旅行 啊 ?
%mor:       e|wai3 a|ci4 u|di1 v|heoi3 d|m4 v|heoi3 vn|leoi5hang4 y|aa3 ?

Here are the corresponding tokens from PyCantonese, where the data in CHAT format has been parsed into Token objects, with the attribute jyutping storing Jyutping romanization:

some_tokens = corpus.tokens(by_utterance=True)[0]
some_tokens
# [Token(word='喂', pos='e', jyutping='wai3', mor=None, gloss=None, gra=None),
#  Token(word='遲', pos='a', jyutping='ci4', mor=None, gloss=None, gra=None),
#  Token(word='啲', pos='u', jyutping='di1', mor=None, gloss=None, gra=None),
#  Token(word='去', pos='v', jyutping='heoi3', mor=None, gloss=None, gra=None),
#  Token(word='唔', pos='d', jyutping='m4', mor=None, gloss=None, gra=None),
#  Token(word='去', pos='v', jyutping='heoi3', mor=None, gloss=None, gra=None),
#  Token(word='旅行', pos='vn', jyutping='leoi5hang4', mor=None, gloss=None, gra=None),
#  Token(word='啊', pos='y', jyutping='aa3', mor=None, gloss=None, gra=None),
#  Token(word='?', pos='', jyutping=None, mor=None, gloss=None, gra=None)]
for token in some_tokens:
    print(token.jyutping)

# wai3
# ci4
# di1
# heoi3
# m4
# heoi3
# leoi5hang4
# aa3
# None

Given the ubiquitous status of Jyutping in the study of Cantonese, the jyutping() method is also defined for convenience:

corpus.jyutping(by_utterance=True)[0]
# ['wai3', 'ci4', 'di1', 'heoi3', 'm4', 'heoi3', 'leoi5hang4', 'aa3', None]

For further processing Jyutping romanization, please see the Jyutping Romanization page.

Chinese Characters

Corpus data in the CHAT format is word-segmented, and the same word segmentation is preserved in the output of the CHAT methods words(), tokens(), and utterances(). For Cantonese data, a (segmented) word can be, say, 廣東話 (“Cantonese”) with three Chinese characters. To work with data at the character level, characters() is available:

corpus.characters(by_utterance=True)[0]
# ['喂', '遲', '啲', '去', '唔', '去', '旅', '行', '啊', '?']

If you independently have Cantonese data in Chinese characters, PyCantonese has tools for word segmentation and part-of-speech tagging.

Word Ngrams

For word counts in various flavors, use the method word_ngrams():

trigrams = corpus.word_ngrams(3)  ## Trigrams
trigrams.most_common(10)
# [(('係', '啊', '.'), 527),
#  ((',', '誒', ','), 520),
#  (('呢', ',', '就'), 219),
#  (('係', '啊', ','), 209),
#  (('係', '囖', '.'), 202),
#  (('吖', '嗎', '.'), 202),
#  (('𡃉', '喎', '.'), 186),
#  (('𠺢', '嗎', '.'), 167),
#  (('係', '喇', '.'), 140),
#  (('係', '喇', ','), 134)]
word_freq = corpus.word_ngrams(1)  # Note that unigrams are also represented as tuples.
word_freq.most_common(10)
# [(('.',), 13251),
#  ((',',), 9282),
#  (('係',), 5019),
#  (('啊',), 4110),
#  (('?',), 2911),
#  (('我',), 2755),
#  (('噉',), 2741),
#  (('呢',), 2734),
#  (('你',), 2570),
#  (('佢',), 2259)]