pycantonese.g2p

Convert Cantonese characters into IPA (grapheme-to-phoneme).

This is a one-shot grapheme-to-phoneme (G2P) helper that composes characters_to_jyutping() and jyutping_to_ipa(). The input is segmented into words (using segment() if a raw string is passed), each word is mapped to Jyutping, and each Jyutping syllable is then mapped to IPA.

Parameters:

chars (str or list[str]) – A string of Cantonese characters, in which case word segmentation is also run on this input string (by segment()) in order to resolve potential ambiguity in mapping characters to Jyutping. If you don’t want word segmentation to be done, then provide a list of strings instead with your desired segmentation.
onsets (dict[str, str], optional) – Custom Jyutping-onset to IPA-symbol overrides, forwarded to jyutping_to_ipa().
nuclei (dict[str, str], optional) – Custom Jyutping-nucleus to IPA-symbol overrides, forwarded to jyutping_to_ipa().
codas (dict[str, str], optional) – Custom Jyutping-coda to IPA-symbol overrides, forwarded to jyutping_to_ipa().

Returns:

A list of segmented words, where each word is a 2-tuple of (Cantonese characters, list of IPA syllables). The IPA list contains one IPA string per character of the word. Any word with no Jyutping mapping (e.g. an unseen character or a punctuation mark) yields None in place of the IPA list.

Return type:

list[tuple[str, list[str] | None]]

Examples

>>> g2p("香港人講廣東話。")  # Hongkongers speak Cantonese.
[('香港人', ['hœŋ55', 'kɔŋ25', 'jɐn21']), ('講', ['kɔŋ25']), ('廣東話', ['kʷɔŋ25', 'tʊŋ55', 'waː25']), ('。', None)]