Question

我正在尝试从文本文档中删除所有非ascii字符。我找到了一个应该做的包，https://pypi.python.org/pypi/Unidecode

它应该接受一个字符串并将所有非ascii字符转换为最接近的ascii字符。我只需要调用while (<input>) { $_ = unidecode($_); }就可以轻松地在perl中使用这个相同的模块，这个模块是perl模块的直接端口，文档表明它应该工作相同。

我确信这很简单，我对字符和文件编码了解不足以了解问题所在。我的origfile以UTF-8编码（从UCS-2LE转换）。问题可能与我缺乏编码知识和处理字符串错误有关，而不是模块，希望有人可以解释为什么。我已经尝试了我所知道的一切，而不是随意插入代码并搜索到目前为止我没有运气的错误。

这是我的python

from unidecode import unidecode

def toascii():
    origfile = open(r'C:\log.convert', 'rb')
    convertfile = open(r'C:\log.toascii', 'wb')

    for line in origfile:
        line = unidecode(line)
        convertfile.write(line)

    origfile.close()
    convertfile.close()

toascii();

如果我没有以字节模式（origfile = open('file.txt','r'）打开原始文件，那么我会从UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 1563: character maps to <undefined>行收到错误for line in origfile:。

如果我以字节模式'rb'打开它，我会从TypeError: ord() expected string length 1, but int found行获得line = unidecode(line)。

如果我将行声明为字符串line = unidecode(str(line))，那么它将写入文件，但是......不正确。 \r\n'b'\xef\xbb\xbf[ 2013.10.05 16:18:01 ] User_Name > .\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\它正在写出\ n，\ r \ n等和unicode字符，而不是将它们转换为任何字符。

如果我将该行转换为字符串，并按字节模式'wb'打开转换文件，则会显示错误TypeError: 'str' does not support the buffer interface

如果我在字节模式下打开它而没有将其声明为字符串'wb'和unidecode(line)，那么我会再次收到TypeError: ord() expected string length 1, but int found错误。

Answer 1

{3}模块接受 unicode 字符串值，在Python 3中返回一个unicode字符串。您将为其提供二进制数据。解码为unicode或打开textmode中的输入文本文件，并在将结果写入文件之前将结果编码为ASCII，或者以文本模式打开输出文本文件。

从模块文档中引用：

模块导出一个函数，该函数接受Unicode对象（Python 2.x）或字符串（Python 3.x）并返回一个字符串（，可以编码为ASCII字节在Python 3.x ）

强调我的。

这应该有效：

unidecode

这将以文本模式打开输入文件（使用UTF8编码，由您的样本行判断是正确的）并以文本模式（编码为ASCII）写入。

您需要明确指定要打开的文件的编码;如果省略编码，则使用当前系统区域设置（def toascii(): with open(r'C:\log.convert', 'r', encoding='utf8') as origfile, open(r'C:\log.toascii', 'w', encoding='ascii') as convertfile: for line in origfile: line = unidecode(line) convertfile.write(line)调用的结果），如果您的代码需要可移植，通常不会是正确的编解码器。

如何在python中使用unidecode（3.3）

1 个答案: