pycantonese.pos_tag

pycantonese.pos_tag(words, tagset='universal')[source]

Tag the words for their parts of speech.

The part-of-speech tagger uses an averaged perceptron model, and is trained by the HKCanCor data.

Added in version 3.1.0.

Parameters:
  • words (list[str]) – A segmented sentence or phrase, where each word is a string of Cantonese characters.

  • tagset (str, {"universal", "hkcancor"}) –

    The part-of-speech tagset that the returned tags are in. Supported options:

Returns:

The segmented sentence/phrase where each word is paired with its predicted POS tag.

Return type:

list[tuple[str, str]]

Raises:
  • TypeError – If the input is a string (e.g., an unsegmented string of Cantonese).

  • ValueError – If the tagset argument is not one of the allowed options from {"universal", "hkcancor"}.

Examples

>>> words = ['我', '噚日', '買', '嗰', '對', '鞋', '。']  # I bought that pair of shoes yesterday.
>>> pos_tag(words)
[('我', 'PRON'), ('噚日', 'ADV'), ('買', 'VERB'), ('嗰', 'PRON'), ('對', 'NOUN'), ('鞋', 'NOUN'), ('。', 'PUNCT')]
>>> pos_tag(words, tagset="hkcancor")
[('我', 'r'), ('噚日', 't'), ('買', 'v'), ('嗰', 'r'), ('對', 'q'), ('鞋', 'n'), ('。', '。')]