pycantonese.pos_tag

pycantonese.pos_tag(words, tagset='universal')[source]

Tag the words for their parts of speech.

The part-of-speech tagger uses an averaged perceptron model, and is trained by the HKCanCor data.

Added in version 3.1.0.

Parameters:

words (list[str]) – A segmented sentence or phrase, where each word is a string of Cantonese characters.
tagset (str, {"universal", "hkcancor"}) –
The part-of-speech tagset that the returned tags are in. Supported options:
- "hkcancor", for the tagset used by the original HKCanCor data. There are over 100 tags, 46 of which are described at https://github.com/fcbond/hkcancor.
- "universal" (default option), for the Universal Dependencies v2 tagset. There are 17 tags; see https://universaldependencies.org/u/pos/index.html. Internally, this option applies hkcancor_to_ud() to convert HKCanCor tags to UD tags.

Returns:

The segmented sentence/phrase where each word is paired with its predicted POS tag.

Return type:

list[tuple[str, str]]

Raises:

TypeError – If the input is a string (e.g., an unsegmented string of Cantonese).
ValueError – If the tagset argument is not one of the allowed options from {"universal", "hkcancor"}.

Examples

>>> words = ['我', '噚日', '買', '嗰', '對', '鞋', '。']  # I bought that pair of shoes yesterday.
>>> pos_tag(words)
[('我', 'PRON'), ('噚日', 'ADV'), ('買', 'VERB'), ('嗰', 'PRON'), ('對', 'NOUN'), ('鞋', 'NOUN'), ('。', 'PUNCT')]
>>> pos_tag(words, tagset="hkcancor")
[('我', 'r'), ('噚日', 't'), ('買', 'v'), ('嗰', 'r'), ('對', 'q'), ('鞋', 'n'), ('。', '。')]