Stop Words

In many natural language processing tasks, it is often necessary to filter stop words, English examples of which include function words such as pronouns and determiners. PyCantonese provides the function stop_words() that returns a set of about 100 Cantonese stop words:

import pycantonese
stop_words = pycantonese.stop_words()
len(stop_words)
# 104
stop_words  # doctest: +SKIP
## {'一啲', '一定', '不如', '不過', ...}

Depending on your use cases, you may like to add or remove stop words from the default ones. The stop_words() function has the optional arguments of add and remove.

add can either be a string (e.g., treat "香港" as a stop word if your data is all about Hong Kong) or an iterable of strings:

import pycantonese
stop_words_1 = pycantonese.stop_words(add='香港')
len(stop_words_1)
# 105
'香港' in stop_words_1
# True
stop_words_2 = pycantonese.stop_words(add=['香港島', '九龍', '新界'])  # Hong Kong Island, Kowloon, the New Territories
len(stop_words_2)
# 107
{'香港島', '九龍', '新界'}.issubset(stop_words_2)
# True

Similarly, the remove argument can also take either a string or an iterable of strings.