pycantonese.g2p

Convert Cantonese characters into IPA (grapheme-to-phoneme).

This is a one-shot grapheme-to-phoneme (G2P) helper that composes characters_to_jyutping() and jyutping_to_ipa(). The input is segmented into words (using segment() if a raw string is passed), each word is mapped to Jyutping, and each Jyutping syllable is then mapped to IPA.

Parameters:

chars (str or list[str]) – A string of Cantonese characters, in which case word segmentation is also run on this input string (by segment()) in order to resolve potential ambiguity in mapping characters to Jyutping. If you don’t want word segmentation to be done, then provide a list of strings instead with your desired segmentation.
onsets (dict[str, str], optional) – Custom Jyutping-onset to IPA-symbol overrides, forwarded to jyutping_to_ipa().
nuclei (dict[str, str], optional) – Custom Jyutping-nucleus to IPA-symbol overrides, forwarded to jyutping_to_ipa().
codas (dict[str, str], optional) – Custom Jyutping-coda to IPA-symbol overrides, forwarded to jyutping_to_ipa().
tones (dict[str, str], optional) – Custom Jyutping-tone to IPA-symbol overrides, forwarded to jyutping_to_ipa().

Returns:

A list of segmented words, where each word is a 2-tuple of (Cantonese characters, IPA string). Within the IPA string, syllables are separated by a single space. Any word with no Jyutping mapping (e.g. an unseen character or a punctuation mark) yields None in place of the IPA string.

Return type:

list[tuple[str, str | None]]

Examples

>>> g2p("香港人講廣東話。")  # Hongkongers speak Cantonese.
[('香港人', 'hœŋ55 kɔŋ25 jɐn21'), ('講', 'kɔŋ25'), ('廣東話', 'kʷɔŋ25 tʊŋ55 waː25'), ('。', None)]