Question

我的代码如下所示：

for file in glob.iglob(os.path.join(dir, '*.txt')):
    print(file)
    with codecs.open(file,encoding='latin-1') as f:
        infile = f.read()

with codecs.open('test.txt',mode='w',encoding='utf-8') as f:
    f.write(infile)

我使用的文件用Latin-1编码（我无法用UTF-8打开它们）。但我想在utf-8中编写生成的文件。

但是这个：

<Trans audio_filename="VALE_M11_070.MP3" xml:lang="español">
<Datos clave_texto=" VALE_M11_070" tipo_texto="entrevista_semidirigida">
<Corpus corpus="PRESEEA" subcorpus="ESESUMA" ciudad="Valencia" pais="España"/>

取而代之的是（在gedit中）：

<Trans audio_filename="VALE_M11_070.MP3" xml:lang="espa뇃漀氀∀㸀ഀ਀㰀䐀愀琀`漀猀 挀氀愀瘀攀开琀攀砀琀漀㴀∀ 嘀䄀䰀䔀开䴀㄀㄀开　㜀

如果我在终端上打印它，它会显示正常。

当我用LibreOffice Writer打开生成的文件时，我得到的更令人困惑的是：

<#T#r#a#n#s# (and so on)

那么如何将latin-1字符串正确转换为utf-8字符串？在python2中，它很容易，但在python3中，它似乎让我感到困惑。

我已经尝试了不同的组合：

#infile = bytes(infile,'utf-8').decode('utf-8')
#infile = infile.encode('utf-8').decode('utf-8')
#infile = bytes(infile,'utf-8').decode('utf-8')

但不知怎的，我总是以同样奇怪的输出结束。

提前致谢！

编辑：这个问题与评论中链接的问题不同，因为它涉及Python 3，而不是Python 2.7。

Answer 1

我找到了半个方法。这不是你想要/需要的，但可以帮助其他人朝着正确的方向......

# First read the file
txt = open("file_name", "r", encoding="latin-1") # r = read, w = write & a = append
items = txt.readlines()
txt.close()

# and write the changes to file
output = open("file_name", "w", encoding="utf-8")
for string_fin in items:
    if "Ã©" in string_fin:
        string_fin = string_fin.replace("Ã©", "é")

    if "Ã«" in string_fin:
        string_fin = string_fin.replace("Ã«", "ë")

    # this works if not to much needs changing...

    output.write(string_fin)

output.close();

* detection

的注释

Answer 2

对于python 3.6：

your_str = your_str.encode('utf-8').decode('latin-1')

Python3：将Latin-1转换为UTF-8

2 个答案: