Stop Words

In many natural language processing tasks, it is often necessary to filter stop words, English examples of which include function words such as pronouns and determiners. PyCantonese provides the function stop_words() that returns a set of about 100 Cantonese stop words:

import pycantonese
stop_words = pycantonese.stop_words()
len(stop_words)
# 104
stop_words  # doctest: +SKIP
## {'一啲', '一定', '不如', '不過', ...}

Depending on your use cases, you may like to add or remove stop words from the default ones. The stop_words() function has the optional arguments of add and remove.

add can either be a string (e.g., treat "香港" as a stop word if your data is all about Hong Kong) or an iterable of strings:

import pycantonese
stop_words_1 = pycantonese.stop_words(add='香港')
len(stop_words_1)
# 105
'香港' in stop_words_1
# True
stop_words_2 = pycantonese.stop_words(add=['香港島', '九龍', '新界'])  # Hong Kong Island, Kowloon, the New Territories
len(stop_words_2)
# 107
{'香港島', '九龍', '新界'}.issubset(stop_words_2)
# True

Similarly, the remove argument can also take either a string or an iterable of strings.