是否有任何脚本,库或程序使用Python
或BASH
工具(例如awk
,perl
,sed
)可以正确转换编号拼音(例如dian4 nao3)到带有音调标记的UTF-8拼音(例如diànnǎo)?
我找到了以下示例,但它们需要PHP
或#C
:
我还发现了各种在线工具,但它们无法处理大量转换。
答案 0 :(得分:19)
我有一些Python 3代码可以做到这一点,并且它足够小,可以直接放在答案中。
PinyinToneMark = {
0: "aoeiuv\u00fc",
1: "\u0101\u014d\u0113\u012b\u016b\u01d6\u01d6",
2: "\u00e1\u00f3\u00e9\u00ed\u00fa\u01d8\u01d8",
3: "\u01ce\u01d2\u011b\u01d0\u01d4\u01da\u01da",
4: "\u00e0\u00f2\u00e8\u00ec\u00f9\u01dc\u01dc",
}
def decode_pinyin(s):
s = s.lower()
r = ""
t = ""
for c in s:
if c >= 'a' and c <= 'z':
t += c
elif c == ':':
assert t[-1] == 'u'
t = t[:-1] + "\u00fc"
else:
if c >= '0' and c <= '5':
tone = int(c) % 5
if tone != 0:
m = re.search("[aoeiuv\u00fc]+", t)
if m is None:
t += c
elif len(m.group(0)) == 1:
t = t[:m.start(0)] + PinyinToneMark[tone][PinyinToneMark[0].index(m.group(0))] + t[m.end(0):]
else:
if 'a' in t:
t = t.replace("a", PinyinToneMark[tone][0])
elif 'o' in t:
t = t.replace("o", PinyinToneMark[tone][1])
elif 'e' in t:
t = t.replace("e", PinyinToneMark[tone][2])
elif t.endswith("ui"):
t = t.replace("i", PinyinToneMark[tone][3])
elif t.endswith("iu"):
t = t.replace("u", PinyinToneMark[tone][4])
else:
t += "!"
r += t
t = ""
r += t
return r
这会处理我遇到的所有ü
,u:
和v
。 Python 2兼容性需要进行少量修改。
答案 1 :(得分:5)
cjklib library确实满足了您的需求:
使用Python shell:
>>> from cjklib.reading import ReadingFactory
>>> f = ReadingFactory()
>>> print f.convert('Bei3jing1', 'Pinyin', 'Pinyin', sourceOptions={'toneMarkType': 'numbers'})
Běijīng
或者只是命令行:
$ cjknife -m Bei3jing1
Běijīng
免责声明:我开发了该库。
答案 2 :(得分:5)
我编写了另一个执行此操作的Python函数,它不区分大小写并保留空格,标点符号和其他文本(当然,除非存在误报):
# -*- coding: utf-8 -*-
import re
pinyinToneMarks = {
u'a': u'āáǎà', u'e': u'ēéěè', u'i': u'īíǐì',
u'o': u'ōóǒò', u'u': u'ūúǔù', u'ü': u'ǖǘǚǜ',
u'A': u'ĀÁǍÀ', u'E': u'ĒÉĚÈ', u'I': u'ĪÍǏÌ',
u'O': u'ŌÓǑÒ', u'U': u'ŪÚǓÙ', u'Ü': u'ǕǗǙǛ'
}
def convertPinyinCallback(m):
tone=int(m.group(3))%5
r=m.group(1).replace(u'v', u'ü').replace(u'V', u'Ü')
# for multple vowels, use first one if it is a/e/o, otherwise use second one
pos=0
if len(r)>1 and not r[0] in 'aeoAEO':
pos=1
if tone != 0:
r=r[0:pos]+pinyinToneMarks[r[pos]][tone-1]+r[pos+1:]
return r+m.group(2)
def convertPinyin(s):
return re.sub(ur'([aeiouüvÜ]{1,3})(n?g?r?)([012345])', convertPinyinCallback, s, flags=re.IGNORECASE)
print convertPinyin(u'Ni3 hao3 ma0?')
答案 3 :(得分:1)
我将代码从dani_l移植到Kotlin(java中的代码应该非常相似)。它是:
import java.util.regex.Pattern
val pinyinToneMarks = mapOf(
'a' to "āáǎà",
'e' to "ēéěè",
'i' to "īíǐì",
'o' to "ōóǒò",
'u' to "ūúǔù",
'ü' to "ǖǘǚǜ",
'A' to "ĀÁǍÀ",
'E' to "ĒÉĚÈ",
'I' to "ĪÍǏÌ",
'O' to "ŌÓǑÒ",
'U' to "ŪÚǓÙ",
'Ü' to "ǕǗǙǛ"
)
fun toPinyin(asciiPinyin: String) :String {
val pattern = Pattern.compile("([aeiouüvÜ]{1,3})(n?g?r?)([012345])")!!
val matcher = pattern.matcher(asciiPinyin)
val s = StringBuilder()
var start = 0
while (matcher.find(start)) {
s.append(asciiPinyin, start, matcher.start(1))
val tone = Integer.parseInt(matcher.group(3)!!) % 5
val r = matcher.group(1)!!.replace("v", "ü").replace("V", "Ü")
// for multple vowels, use first one if it is a/e/o, otherwise use second one
val pos = if (r.length >1 && r[0].toString() !in "aeoAEO") 1 else 0
if (tone != 0) s.append(r, 0, pos).append(pinyinToneMarks[r[pos]]!![tone - 1]).append(r, pos + 1, r.length)
else s.append(r)
s.append(matcher.group(2))
start = matcher.end(3)
}
if (start != asciiPinyin.length) s.append(asciiPinyin, start, asciiPinyin.length)
return s.toString()
}
fun test() = print(toPinyin("Ni3 hao3 ma0?"))
答案 4 :(得分:1)
我将@Lakedaemon的Kotlin代码移植到Java。
// auxiliary function
static public String getCharacter(String string, int position) {
char[] characters = string.toCharArray();
return String.valueOf(characters[position]);
}
static public String toPinyin(String asciiPinyin) {
Map<String, String> pinyinToneMarks = new HashMap<String, String>();
pinyinToneMarks.put("a", "āáǎà"); pinyinToneMarks.put("e", "ēéěè");
pinyinToneMarks.put("i", "īíǐì"); pinyinToneMarks.put("o", "ōóǒò");
pinyinToneMarks.put("u", "ūúǔù"); pinyinToneMarks.put("ü", "ǖǘǚǜ");
pinyinToneMarks.put("A", "ĀÁǍÀ"); pinyinToneMarks.put("E", "ĒÉĚÈ");
pinyinToneMarks.put("I", "ĪÍǏÌ"); pinyinToneMarks.put("O", "ŌÓǑÒ");
pinyinToneMarks.put("U", "ŪÚǓÙ"); pinyinToneMarks.put("Ü", "ǕǗǙǛ");
Pattern pattern = Pattern.compile("([aeiouüvÜ]{1,3})(n?g?r?)([012345])");
Matcher matcher = pattern.matcher(asciiPinyin);
StringBuilder s = new StringBuilder();
int start = 0;
while (matcher.find(start)) {
s.append(asciiPinyin, start, matcher.start(1));
int tone = Integer.parseInt(matcher.group(3)) % 5;
String r = matcher.group(1).replace("v", "ü").replace("V", "Ü");
// for multple vowels, use first one if it is a/e/o, otherwise use second one
int pos = r.length() > 1 && "aeoAEO".contains(getCharacter(r,0).toString())? 1 : 0;
if (tone != 0) {
s.append(r, 0, pos).append(getCharacter(pinyinToneMarks.get(getCharacter(r, pos)),tone - 1)).append(r, pos + 1, r.length());
} else {
s.append(r);
}
s.append(matcher.group(2));
start = matcher.end(3);
}
if (start != asciiPinyin.length()) {
s.append(asciiPinyin, start, asciiPinyin.length());
}
return s.toString();
}
答案 5 :(得分:0)
使用python dragonmapper(pip install dragonmapper
):
汉字到拼音
from dragonmapper.transcriptions import hanzi
hanzi.to_pinyin("过河拆桥。")
# >>> 'guòhéchāiqiáo。'
hanzi.to_pinyin("过河拆桥。", accented=False)
# >>> 'guo4he2chai1qiao2。'
重音拼音为编号的拼音
from dragonmapper.transcriptions import accented_to_numbered
accented_to_numbered('guò hé chāi qiáo。')
# >>> 'guo4 he2 chai1 qiao2。'
编号为拼音的重音拼音
from dragonmapper.transcriptions import numbered_to_accented
numbered_to_accented('guo4 he2 chai1 qiao2。')
# >>> 'guò hé chāi qiáo。'
免责声明:我与Dragonmapper作者没有关系
答案 6 :(得分:-1)
我遇到了一个VBA宏,它在Microsoft Word中执行, pinyinjoe.com
我报告了一个小缺陷,他回答说他会尽快“尽我所能”加入我的建议。那是2014年1月初;我没有任何检查动机,因为它已在我的副本中完成。