Grapheme-to-Phoneme Conversion

Grapheme-to-phoneme (G2P) conversion — mapping written characters to the phonemes that pronounce them — is a foundational task in linguistics and speech/NLP applications such as text-to-speech, pronunciation modeling, and phonological analysis. For Cantonese, that means going from Chinese characters straight to IPA (the International Phonetic Alphabet).

The g2p() function provides a one-call grapheme-to-phoneme pipeline for Cantonese, composing characters_to_jyutping() and jyutping_to_ipa():

import pycantonese
pycantonese.g2p('香港人講廣東話。')  # Hongkongers speak Cantonese.
# [('香港人', 'hœŋ55 kɔŋ25 jɐn21'),
#  ('講', 'kɔŋ25'),
#  ('廣東話', 'kʷɔŋ25 tʊŋ55 waː25'),
#  ('。', None)]

The output is a list of segmented words, where each word is a 2-tuple of (Cantonese characters, IPA string). Within the IPA string, syllables are separated by a single space.

When g2p() receives a raw string, it runs word segmentation internally (via segment()) so that contextually disambiguated pronunciations come through:

import pycantonese
## 蛋 alone is pronounced with tone 2 (high-rising).
pycantonese.g2p('蛋')  # egg
# [('蛋', 'taːn25')]

## 蛋 in 蛋糕 is pronounced with tone 6 (low-level).
pycantonese.g2p('蛋糕')  # cake
# [('蛋糕', 'taːn22 kou55')]

If you would rather supply your own segmentation, pass a list of words instead of a string:

import pycantonese
pycantonese.g2p(['廣東', '話'])  # Cantonese
# [('廣東', 'kʷɔŋ25 tʊŋ55'), ('話', 'waː22')]

Any word with no Jyutping mapping — an unseen Cantonese character, a punctuation mark, or a non-Chinese token — yields None in place of the IPA string, so callers can detect and handle out-of-vocabulary items explicitly:

import pycantonese
pycantonese.g2p('佢成日呃like')
# [('佢', 'kʰɵy23'),
#  ('成日', 'sɪŋ21 jɐt̚22'),
#  ('呃', 'ŋaːk̚55'),
#  ('like', None)]

The IPA mapping inherits the choices of jyutping_to_ipa(), which follows Matthews and Yip (2011: 461-463). To customize specific IPA symbols, g2p() accepts the keyword arguments onsets, nuclei, codas, and tones, each a dictionary that maps a Jyutping sound to your desired IPA symbol:

import pycantonese
pycantonese.g2p('我')
# [('我', 'ŋɔ23')]
pycantonese.g2p('我', onsets={'ng': 'n'})
# [('我', 'nɔ23')]

For lower-level access — for example, if you want the intermediate Jyutping output, or want to control tones — call characters_to_jyutping() and jyutping_to_ipa() directly. See Jyutping Romanization for details.