尝试将HTML标记的文本写入.txt文件时出错-Python

时间:2019-04-10 13:24:14

标签: python html file runtime-error writing

尝试将包含HTML标记的字典键值写入文本文件时收到以下错误。

Traceback (most recent call last):
  File "/Users/jackboland/PycharmProjects/NLTK_example/JsonToTxt.py", line 11, in <module>
    data = json.load(json_data)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 10: invalid start byte

我有一组JSON文件。我已成功将数据提取到Python字典中。然后,从那里,我确定值最长的字典键,并将该值提取到文本文件中。该代码适用于最长字典键值为字符串的所有JSON文件。对于最长的字典键值为html内容的文件,将引发上述错误。

with open(path + file) as json_data:
    data = json.load(json_data)
    for value in data.values():  # gets the value of each dictionary key
    value = str(value)  # converts the value of each dictionary key to a string enabling counting total characters
    vLength = len(value)  # calculates the length of each value in every dictionary key to enable identifying only the longest of the key values
    if vLength > 100:  # if the length of a value is over 200 characters, it prints that, this insures capturing the opinion text no matter what dictionary key it is in
    f = open(newpath + file[:-5] + ".txt", 'w+')
    f.write(value)
    f.close()

作为字典的字典键值是从字典解析到文本文件的。只有包含HTML的字典键值不会被写入文本文件。

1 个答案:

答案 0 :(得分:0)

Python尝试将字节数组转换为unicode字符串。尝试此操作时,会遇到utf-8编码的字符串中不允许的字节序列(此处位于位置10的0xc0)。

尝试以二进制格式读取文件,以使文件内容保留为字节。

with open(path + file, 'rb') as json_data:
     //rest of the code

如果这不起作用,请手动指定编码格式。

示例:

open(path + file, encoding="utf-8") as json_data
     //rest of the code

您可以在此处获得各种编码格式。

https://docs.python.org/2.4/lib/standard-encodings.html