Question

我有一些以ASCII格式存储的旧数据。显然，存在UTF-8数据在写入之前未正确转换为ASCII。例如，José将在文件中显示为JosÃ©。我可以使用下面的Java代码段轻松解决此问题：

byte[] utf8Bytes = c_TOBETRANSLATED.getBytes("ISO-8859-1");
String s2 = new String(utf8Bytes,"UTF-8");

但是我需要用其余的代码来做这个Python。我只是刚开始使用Python，而我的互联网搜索和反复试验并没有帮助我找到一个执行相同操作的Python解决方案。

Answer 1

如果您使用的是Python 3，则可以使用bytes function执行以下操作：

test = "JosÃ©"
fixed = bytes(test, 'iso-8859-1').decode('utf-8')
# fixed will now contain the string José

Answer 2

import locale

# Correctly written
with open('file.txt','w',encoding='utf8') as f:
    f.write('José')

# The default encoding for open()
print(locale.getpreferredencoding(False))

# Incorrectly opened
with open('file.txt') as f:
    data = f.read()
    print(data)
    # What I think you are requesting as a fix.
    # Re-encode with the incorrect encoding, then decode correctly.
    print(data.encode('cp1252').decode('utf8'))

# Correctly opened
with open('file.txt',encoding='utf8') as f:
    print(f.read())

输出：

cp1252
JosÃ©
José
José

如何更正存储为ASCII的UTF-8字符

2 个答案: