Part-of-Speech Tagging
A basic part-of-speech tagger is provided by pos_tag(),
which takes a segmented phrase or sentence as the input:
import pycantonese
unsegmented = '我噚日買對鞋。' # I bought a pair of shoes yesterday.
segmented = pycantonese.segment(unsegmented)
segmented
# ['我', '噚日', '買', '對鞋', '。']
pycantonese.pos_tag(segmented)
# [('我', 'PRON'), ('噚日', 'ADV'), ('買', 'VERB'), ('對鞋', 'NOUN'), ('。', 'PUNCT')]
The part-of-speech tagger uses the averaged perceptron model trained on
HKCanCor data.
HKCanCor has already been annotated for part-of-speech tags,
with a tagset of over 100 tags
(46 of which are described).
By default, pos_tag() maps the HKCanCor tagset to the
Universal Dependencies v2 tagset
(with 17 tags),
for cross-linguistic natural language processing work.
If you would like the original HKCanCor tagset,
pos_tag() accepts the keyword argument tagset:
pycantonese.pos_tag(segmented, tagset="hkcancor")
# [('我', 'r'), ('噚日', 't'), ('買', 'v'), ('對鞋', 'n'), ('。', '。')]
The helper function hkcancor_to_ud()
exposes the tagset mapping from HKCanCor to Universal Dependencies.
Due to the statistical nature of part-of-speech tagging,
the quality of results from pos_tag() depends on
(i) the training data,
(ii) the quality of word segmentation, since the function expects a segmented input.
Currently, a major limitation is the fact that HKCanCor is perhaps still
the only Cantonese corpus with a permissive license that comes annotated
with part-of-speech tags.
Its relatively small size (about 150,000 tagged words) means that models
more sophisticated than a standard averaged perceptron approach wouldn’t be worth it.
If you think the results from pos_tag() are odd,
it is potentially due to the HKCanCor training data
(e.g., specific occurrences of word + tag combinations might have thrown off the tagger),
or the quality of word segmentation, especially if your segmented input comes from
segment()
– please get in touch
if you would like further investigation.