Question

我有一堆文本文件包含带有错误编码的韩文字符。具体来说，似乎字符是用EUC-KR编码的，但文件本身是用UTF8 + BOM保存的。

到目前为止，我设法修复了以下文件：

使用EditPlus打开文件（它显示文件的编码为UTF8+BOM）
在EditPlus中，将文件另存为ANSI

最后，在Python中：

with codecs.open(html, 'rb', encoding='euc-kr') as source_file:
    contents = source_file.read()

with open(html, 'w+b') as dest_file:
    dest_file.write(contents.encode('utf-8'))

我想自动化，但我无法这样做。我可以用Python打开原始文件：

codecs.open(html, 'rb', encoding='utf-8-sig')

但是，我无法弄清楚如何做 2。部分。

Answer 1

我在这里假设您已将文本已经编码到EUC-KR，然后再将编码为到UTF-8。如果是这样，编码为Latin 1（Windows称为ANSI）确实是回到原始EUC-KR字节串的最佳方式。

将文件打开为带有BOM的UTF8，编码为Latin1，解码为EUC-KR：

import io

with io.open(html, encoding='utf-8-sig') as infh:
    data = infh.read().encode('latin1').decode('euc-kr')

with io.open(html, 'w', encoding='utf8') as outfh:
    outfh.write(data)

我在这里使用io.open() function代替codecs作为更健壮的方法; io是新的Python 3库，也向后移植到Python 2。

演示：

>>> broken = '\xef\xbb\xbf\xc2\xb9\xc3\x8c\xc2\xbc\xc3\xba'
>>> print broken.decode('utf-8-sig').encode('latin1').decode('euc-kr')
미술

修复损坏的编码（使用Python）

1 个答案: