Converting Chinese characters into pinyin can be used in common scenarios such as batch Chinese phonetic notation, text sorting, Pinyin retrieval and so on.
There are many Pinyin conversion tools on the Internet, as well as many open-source modules based on Python. Today, I will introduce a module with the most functions and features: pypinyin, which supports the following features:
- 1. Intelligently match the most correct Pinyin according to phrases.
- 2. Support multi tone words.
- 3. Simple traditional support and phonetic support.
- 4. Support a variety of different Pinyin / phonetic styles.
- 5. Command line tool one click conversion
1. Preparation
Before starting, you should ensure that Python and pip have been successfully installed on your computer. If not, please install them first.
(optional 1) if you use Python for data analysis, you can directly install Anaconda: it has built-in Python and pip.
(optional 2) in addition, it is recommended that you use the VSCode editor, which has many advantages.
Please choose one of the following ways to enter the command to install dependencies: 1. Open Cmd (start run CMD) in Windows environment. 2. Open terminal in MacOS environment (Command + space, enter Terminal). 3. If you use the VSCode editor or pychart, you can directly use the Terminal at the bottom of the interface
pip install pypinyin
2. Basic use
The most common Pinyin conversion methods are as follows:
# Python practical dictionary from pypinyin import pinyin, lazy_pinyin, Style pinyin('core') # [['zhōng'], ['xīn']]
Recognize polyphonic words:
# Python practical dictionary from pypinyin import pinyin, lazy_pinyin, Style pinyin('core', heteronym=True) # Enable polyphonic mode # [['zhōng', 'zhòng'], ['xīn']]
Set the output style to recognize only the initials:
# Python practical dictionary from pypinyin import pinyin, lazy_pinyin, Style pinyin('core', style=Style.FIRST_LETTER) # Set Pinyin style # [['z'], ['x']]
Modify the tone output position to display the tone after the corresponding letter, or the last displayed tone of Pinyin:
# Python practical dictionary from pypinyin import pinyin, lazy_pinyin, Style # TONE2 displays the tone after the corresponding letter pinyin('core', style=Style.TONE2, heteronym=True) # [['zho1ng', 'zho4ng'], ['xi1n']] # TONE3 Pinyin last display tone pinyin('core', style=Style.TONE3, heteronym=True) # [['zhong1', 'zhong4'], ['xin1']]
Not considering polyphonic words:
# Python practical dictionary from pypinyin import pinyin, lazy_pinyin, Style lazy_pinyin('core') # Do not consider the case of polyphonic words # ['zhong', 'xin']
Do not use v instead of U:
# Python practical dictionary from pypinyin import pinyin, lazy_pinyin, Style lazy_pinyin('strategy', v_to_u=True) # Do not use v for ü # ['zhan', 'lüe']
Mark softly:
# Python practical dictionary from pypinyin import pinyin, lazy_pinyin, Style # Use 5 identification soft sound lazy_pinyin('clothes', style=Style.TONE3, neutral_tone_with_five=True) # ['yi1', 'shang5']
Use the command line one key to recognize Pinyin:
# Python practical dictionary python -m pypinyin music # yīn yuè
3. Advanced use
Custom Pinyin display style
We can use register() to realize the requirements of custom Pinyin style:
from pypinyin import lazy_pinyin from pypinyin.style import register @register('kiss') def kiss(pinyin, **kwargs): return '😘 {0}'.format(pinyin) lazy_pinyin('kiss', style='kiss') # ['😘 me', '😘 me']
It can be seen that by defining a kiss function and using the register decorator, we have generated a new style, which can be directly used for pinyin conversion parameters, which is very convenient.
In addition, the style s and effects of all modules are as follows:
@unique class Style(IntEnum): """Pinyin style""" #: normal style, without tone. For example, China - > ` ` China Guo`` NORMAL = 0 #: standard tone style. Pinyin tone is on the first letter of vowel (default style). For example, China - > ` ` zh ō ng guó`` TONE = 1 #: tone style 2, that is, the phonetic tone is represented by numbers [1-4] after each vowel. For example: China - > ` ` zho1ng guo2`` TONE2 = 2 #: tone style 3, that is, Pinyin tone is represented by numbers [1-4] after each Pinyin. For example: China - > ` ` Zhong1 guo2`` TONE3 = 8 #: Initials style, only the initials of each Pinyin are returned (Note: some pinyin have no initials, see `#27`_). For example: China - > ` ` zh G`` INITIALS = 3 #: initial style. Only the initial part of Pinyin is returned. For example: China - > ` ` Z G`` FIRST_LETTER = 4 #: vowel style. Only the vowel part of each pinyin is returned without tone. For example: China - > ` ` ong uo`` FINALS = 5 #: Standard vowel style, with tone, and the tone is on the first letter of the vowel. For example: China - >`` ō ng uó`` FINALS_TONE = 6 #: vowel style 2, with tone, which is represented by numbers [1-4] after each vowel. For example: China - > ` ` o1ng UO2`` FINALS_TONE2 = 7 #: vowel style 3, with tone, which is represented by numbers [1-4] after each Pinyin. For example: China - > ` ` ong1 UO2`` FINALS_TONE3 = 9 #: phonetic style, with tone, Yin Ping (the first tone) is not marked. For example, China - > ` ` ㄓㄨㄥㄍㄨㄛ ˊ`` BOPOMOFO = 10 #: phonetic style, initials only. For example: China - > ` ` ㄍ`` BOPOMOFO_FIRST = 11 #: the Chinese phonetic alphabet is compared with the Russian alphabet. The tone is represented by numbers [1-4] after each phonetic alphabet. For example: China - >`` чжун one го 2`` CYRILLIC = 12 #: Chinese pinyin and Russian alphabet contrast style, only the first letter. For example: China - >`` ч г`` CYRILLIC_FIRST = 13
Handling special characters
By default, special characters in text will be returned as they are without any processing:
pinyin('Hello☆☆') # [['nǐ'], ['hǎo'], ['☆☆']]
However, if you want to process these special characters, for example:
Ignore: ignore this character
pinyin('Hello☆☆', errors='ignore') # [['nǐ'], ['hǎo']]
errors: replace with the unicode encoding of \ u #:
pinyin('Hello☆☆', errors='replace') # [['nǐ'], ['hǎo'], ['26062606']]
callable object: provides a callback function that accepts Pinyin free characters (strings) as parameters. The supported return value types are unicode or {list or} None:
pinyin('Hello☆☆', errors=lambda x: 'star') # [['nǐ'], ['hǎo'], ['star']] pinyin('Hello☆☆', errors=lambda x: None) # [['nǐ'], ['hǎo']]
When the return value type is list, it will automatically expand list:
pinyin('Hello☆☆', errors=lambda x: ['star' for _ in x]) # [['nǐ'], ['hǎo'], ['star'], ['star']] # Specify polyphone pinyin('Hello☆☆', heteronym=True, errors=lambda x: [['star', '☆'] for _ in x]) # [['nǐ'], ['hǎo'], ['star', '☆'], ['star', '☆']]
Custom Pinyin Library
If you feel that the output effect of the module is not satisfactory to you, or you want to do special processing, you can use load_single_dict() or load_phrases_dict() corrects the result by customizing the Pinyin Library:
from pypinyin import lazy_pinyin, load_phrases_dict, Style, load_single_dict hans = 'Orange' lazy_pinyin(hans, style=Style.TONE2) # ['jie2', 'zi3'] load_phrases_dict({'Orange': [['jú'], ['zǐ']]}) # Add the phrase "orange" lazy_pinyin(hans, style=Style.TONE2) # ['ju2', 'zi3'] hans = 'not yet' lazy_pinyin(hans, style=Style.TONE2) # ['hua2n', 'me2i'] load_single_dict({ord('still'): 'hái,huán'}) # Adjust the Pinyin order of "Huan" lazy_pinyin('not yet', style=Style.TONE2) # ['ha2i', 'me2i']