Question

我正在尝试替换数百个JSON文件中的某些文本（“ id”->“ doc_id”）。这些文件包括从互联网上抓取的文字，文字编码应为utf-8，但可能并非始终如此。我尝试读取二进制文件，然后解码为utf-8，然后替换文本，然后写入文件。编码部分有错误=“替换”，但我仍然收到错误！而且有趣的是，如果我多次运行该程序，它将卡在不同的文件中。

为什么即使使用“替换”也会出现错误？

代码如下：

import os

folder = 'C:\\some\\path'

for file in os.listdir(folder):
    if file.endswith('.json'):
        print('Processing: ', file)
        f = open(file, 'rb')
        binary_text = f.read()
        f.close()
        decoded_text = binary_text.decode(encoding='UTF-8', errors='replace')
        replaced_text = decoded_text.replace('"id":', '"doc_id":')
        f = open(file, 'w')
        f.write(replaced_text)
        f.close()

print('Done!')

这是我得到的示例错误之一：

Traceback (most recent call last):
  File "C:\some\path\id_to_docid.py", line 13, in <module>
    binary_text = f.read()
  File "C:\Program Files\Python36\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 7423: character maps to <undefined>

Answer 1

好的，我似乎已经找到问题了。后者 f = open(file, 'w') 也应该是 f = open(file, 'w', encoding='UTF-8', errors='replace') 现在，一切运行顺利。我想是菜鸟的错误...

将str.decode与errors ='replace'一起使用仍会出错

1 个答案: