Question

我试图转换包含中文字符的本地化文件，以便将中文字符转换为latin1编码。

但是，当我运行python脚本时，我收到此错误...

UnicodeDecodeError: 'ascii' codec can't decode byte 0xb9 in position 0: ordinal not in range(128)

这是我的python脚本，它基本上只需要用户输入来转换所述文件。然后转换文件（所有以[或为空的，应跳过的）行... ...需要转换的部分始终位于列表中的索引1处。

# coding: utf8

# Enter File Name
file_name = raw_input('Enter File Path/Name To Convert: ')

# Open the File we Write too...
write_file = open(file_name + "_temp", 'w+')

# Open the File we Read From...
read_file = open(file_name)

with open(file_name) as file_to_write:
    for line in file_to_write:
        # We ignore any line that starts with [] or is empty...
        if line and line[0:1] != '[':
            split_string = line.split("=")
            if len(split_string) == 2:
                write_file.write(split_string[0] + "=" + split_string[1].encode('gbk').decode('latin1') + "\n")
            else:
                write_file.write(line)
        else:
            write_file.write(line)



# Close File we Write too..
write_file.close()

# Close File we read too..
read_file.close()

示例配置文件是......

[Example]
Password=密碼

输出应转换为......

[Example]
Password=±K½X

Answer 1

Latin1编码不能代表中文字符。如果你拥有的全部输出是latin1，你可以获得更好的转义序列。

您正在使用Python 2.x - Python3.x将文件作为文本打开，并在读取时自动将读取的字节解码为（unicode）字符串。

在Python2中，当您读取文件时，您将获得字节 - 您负责解码这些字节到文本（Python 2.x中的unicode对象） - 处理它们并重新编码它们将信息记录到另一个文件时所需的编码。

所以，行：

write_file.write(split_string[0] + "=" + split_string[1].encode('gbk').decode('latin1') + "\n")

应该是：

write_file.write(split_string[0] + "=" + split_string[1].decode('gbk').encode('latin1', errors="escape") + "\n")

代替。

现在，请注意我已将参数errors="escape"添加到decode调用 - 正如我上面所说的那样：latin1是一个包含大约233个字符的字符集 - 它包含拉丁字母，以及最常用的重音字符（“áéíóúçã”等等），一些puntuaction和数学符号，但没有其他语言的字符。

如果您必须将这些表示为文本，则应使用utf-8编码 - 并配置您正在使用的任何软件，以将生成的文件用于该编码。

那就是说，你所做的只是一种可怕的做法。除非您打开一个已知包含不同编码文本的真正噩梦文件，否则您应该将所有文本解码为unicode，并将它们全部重新编码 - 而不仅仅是要使用非ASCII的数据部分字符。如果您在原始文件中有其他gbk不兼容的字符，请不要这样做，否则，您的内部循环也可能是：

with open(file_name) as read_file, open(file_name + "_temp", "wt") as write_file:
    for line in read_file:
        write_file.write(line.decode("gbk").encode("utf-8")

至于你的“示例输出” - 这只是_very_same文件，即第一个文件中的相同字节。显示行的程序：“Password =密码”是“看到”带有GBK编码的文件，另一个程序“看到”完全相同的字节，但将它们解释为latin1。您不应该从一个转换为另一个。

Python中的汉字到拉丁文1的字符编码

1 个答案: