Grapheme-to-Phoneme Conversion

Grapheme-to-phoneme (G2P) conversion — mapping written characters to the phonemes that pronounce them — is a foundational task in linguistics and speech/NLP applications such as text-to-speech, pronunciation modeling, and phonological analysis. For Cantonese, that means going from Chinese characters straight to IPA (the International Phonetic Alphabet).

The g2p() function provides a one-call grapheme-to-phoneme pipeline for Cantonese, composing characters_to_jyutping() and jyutping_to_ipa():

import pycantonese
pycantonese.g2p('香港人講廣東話。')  # Hongkongers speak Cantonese.
# [('香港人', ['hœŋ55', 'kɔŋ25', 'jɐn21']),
#  ('講', ['kɔŋ25']),
#  ('廣東話', ['kʷɔŋ25', 'tʊŋ55', 'waː25']),
#  ('。', None)]

The output is a list of segmented words, where each word is a 2-tuple of (Cantonese characters, list of IPA syllables). The IPA list contains one IPA string per character in the word.

When g2p() receives a raw string, it runs word segmentation internally (via segment()) so that contextually disambiguated pronunciations come through:

import pycantonese
## 蛋 alone is pronounced with tone 2 (high-rising).
pycantonese.g2p('蛋')  # egg
# [('蛋', ['taːn25'])]

## 蛋 in 蛋糕 is pronounced with tone 6 (low-level).
pycantonese.g2p('蛋糕')  # cake
# [('蛋糕', ['taːn22', 'kou55'])]

If you would rather supply your own segmentation, pass a list of words instead of a string:

import pycantonese
pycantonese.g2p(['廣東', '話'])  # Cantonese
# [('廣東', ['kʷɔŋ25', 'tʊŋ55']), ('話', ['waː22'])]

Any word with no Jyutping mapping — an unseen Cantonese character, a punctuation mark, or a non-Chinese token — yields None in place of the IPA list, so callers can detect and handle out-of-vocabulary items explicitly:

import pycantonese
pycantonese.g2p('佢成日呃like')
# [('佢', ['kʰɵy23']),
#  ('成日', ['sɪŋ21', 'jɐt̚22']),
#  ('呃', ['ŋaːk̚55']),
#  ('like', None)]

The IPA mapping inherits the choices of jyutping_to_ipa(), which follows Matthews and Yip (2011: 461-463). To customize specific IPA symbols, g2p() accepts the keyword arguments onsets, nuclei, and codas, each a dictionary that maps a Jyutping sound to your desired IPA symbol:

import pycantonese
pycantonese.g2p('我')
# [('我', ['ŋɔ23'])]
pycantonese.g2p('我', onsets={'ng': 'n'})
# [('我', ['nɔ23'])]

For lower-level access — for example, if you want the intermediate Jyutping output, or want to control tones, or want a single space-joined IPA string — call characters_to_jyutping() and jyutping_to_ipa() directly. See Jyutping Romanization for details.