Stop Words
In many natural language processing tasks, it is often necessary to filter
stop words, English examples of which include function words such as
pronouns and determiners. PyCantonese provides the function
stop_words()
that returns a set of about 100 Cantonese stop words:
import pycantonese
stop_words = pycantonese.stop_words()
len(stop_words)
# 104
stop_words # doctest: +SKIP
## {'一啲', '一定', '不如', '不過', ...}
Depending on your use cases, you may like to add or remove stop words
from the default ones.
The stop_words() function has the optional arguments of
add and remove.
add can either be a string (e.g., treat "香港" as a stop word if your
data is all about Hong Kong) or an iterable of strings:
import pycantonese
stop_words_1 = pycantonese.stop_words(add='香港')
len(stop_words_1)
# 105
'香港' in stop_words_1
# True
stop_words_2 = pycantonese.stop_words(add=['香港島', '九龍', '新界']) # Hong Kong Island, Kowloon, the New Territories
len(stop_words_2)
# 107
{'香港島', '九龍', '新界'}.issubset(stop_words_2)
# True
Similarly, the remove argument can also take either a string or an iterable
of strings.