Question

所以我一直在使用CC-CEDICT，这是一本免费下载的汉英词典。我一直在使用python做一些小改动并重新格式化字典。当我运行仅将字典重新组织为csv文件的代码时，我没有遇到任何问题，并且字符按预期写入文件中。这是代码：

filename = 'cedict_ts.u8.txt'
newname = 'cedict_ts.u8.csv'

f = open(filename,'r')
allLines = f.readlines()
f.close()

newf = open(newname, 'w')
endofhash = False
for i in range(0, len(allLines)):
 curLine = allLines[i]
 if curLine[0] == '#':
     newf.write(curLine)
 else:
     if(not endofhash):
        newarr = ['Traditional','Simplified','Pinyin','Definition(s)\r\n']
        newline = ','.join(newarr)
        newf.write(newline)
        endofhash = True

    firstws = curLine.find(' ')
    lsbrack = curLine.find('[')
    rsbrack = curLine.find(']')
    fslash = curLine.find('/')
    lslash = curLine.rfind('/')
    trad = curLine[0:firstws]
    simp = curLine[firstws+1:lsbrack-1]
    piny = curLine[lsbrack+1:rsbrack]
    defin = curLine[fslash+1:lslash]
    defin = defin.replace('/','; ')
    defin = defin + '\r\n'
    newarr = [trad, simp, piny, defin]
    newline = ','.join(newarr)
    newf.write(newline)

newf.close()

然而，当我运行一个也改变拼音系统并将其添加到字典中的程序时，文本文件的内容就是gobbly-gook。但是，作为测试，我让程序在将每行写入文本文件之前打印出来，并按预期打印到终端。以下是执行此操作的代码：

from pinyinConverter import *

filename = 'cedict_ts.u8.txt'
newname = 'cedict_ts_wpym.u8.csv'

f = open(filename,'r')
allLines = f.readlines()
f.close()

apy = readPinyinTextfile('pinyinchars.txt')

newf = open(newname, 'w')
endofhash = False
for i in range(0, len(allLines)):
    curLine = allLines[i]
    if curLine[0] == '#':
        newf.write(curLine)
    else:
        if(not endofhash):
            newarr = ['Traditional','Simplified','Pinyin','PinyinWithMarks','Definition(s)\r\n']
            newline = ','.join(newarr)
            newf.write(newline)
            endofhash = True

        firstws = curLine.find(' ')
        lsbrack = curLine.find('[')
        rsbrack = curLine.find(']')
        fslash = curLine.find('/')
        lslash = curLine.rfind('/')
        trad = curLine[0:firstws]
        simp = curLine[firstws+1:lsbrack-1]
        piny = curLine[lsbrack+1:rsbrack]
        split_piny = piny.split(' ')
        for i in range(0, len(split_piny)):
            curPin = split_piny[i]
            newPin = convertPinyinSystem(curPin, apy)
            split_piny[i] = newPin
        pnwm = ' '.join(split_piny)
        defin = curLine[fslash+1:lslash]
        defin = defin.replace('/','; ')
        defin = defin + '\r\n'
        newarr = [trad, simp, piny, pnwm, defin]
        newline = ','.join(newarr)
        newf.write(newline)

newf.close()

这是pinyinConverter文件代码：

def convertPinyinSystem(inputString, allPinyin):

    chars = ['a','e', 'i', 'o','u','u:']

    tone = grabTone(inputString)
    toneIdx = (tone - 1) * 2
    hasIdx = -1
    for i in range(0, len(chars)):
        if(chars[i] in inputString):
            hasIdx = i
    newString = inputString
    newString = newString.replace(str(tone),'')
    if(not ('iu' in inputString)):
        newChar = allPinyin[hasIdx][toneIdx:toneIdx+2]
    else:
        newChar = allPinyin[4][toneIdx:toneIdx+2]

    newString = newString.replace(chars[hasIdx],newChar)
    if(tone == 5):
        newString = inputString
        newString = newString.replace(str(tone),'')
        return newString
    elif(tone == -1):
        return inputString
    else:
        return newString




def readPinyinTextfile(pinyintextfile):
    f = open(pinyintextfile, 'r')
    allLines = f.readlines()
    f.close()
    for i in range(0, len(allLines)):
        curLine = allLines[i]
        curLine = curLine[0:len(curLine)-1]
        allLines[i] = curLine

    return allLines



def grabTone(inputText):

    isToneIdx = False
    idx = 0
    while(not isToneIdx):
        isToneIdx = is_int(inputText[idx])
        if(isToneIdx):
            break
        else:
            idx += 1
            if(idx == len(inputText)):
                return -1

    return int(inputText[idx])


def is_int(s):
    try:
        int(s)
        return True
    except ValueError:
        return False

拼音chars.txt文件的内容是：

āáăà
ēéĕè
īíĭì
ōóŏò
ūúŭù
ǖǘǚǜ

我使用的是2009 MacBook Pro，运行OSX版本10.8.5，python版本是2.7.6，字典的编码是UTF-8。另外我知道一些拼音转换的代码没有经过优化，但对此并不重要。

Answer 1

如果您的拼音文件编码为utf-8，您可能想尝试使用codecs包，它是标准库的一部分，如下所示：

import codecs

...

def readPinyinTextfile(pinyintextfile):
    f = codecs.open(pinyintextfile, 'r', 'utf-8')

如果在终端中看起来没问题，那么您可能需要专门更改写入功能以使用编解码器包：

apy = readPinyinTextfile('pinyinchars.txt')

newf = codecs.open(newname, 'w', 'utf-8')

python中文字符没有正确写入文件... ..与一些程序

1 个答案: