Question

我知道有很多关于编码解码的问题，但我似乎没有想到这一点：

def content(title, sents):
sent_elems = []
for sent_i, sent in enumerate(sents, 1):


    elem = u"<a name=\"{i}\">[{i}]</a> <a href=\"#{i}\" id={i}>{text}</a>".format(i=sent_i, text=sent.text)
    sent_elems.append(elem)
doc = u"""<html>
<head>
<title>{title}</title>
</head>
<body>{elems}</body>
</html>""".format(title=title, elems="\n".join(sent_elems))

return doc

调用内容函数会在非常罕见的情况下给我这个错误（在我的整个数据集中可能是一两次）：

 File "processing.py", line 68, in score_summary
self._write_config(references, summary)
  File "processing.py", line 56, in _write_config
reference_files = self._write_references(references, reference_dir)
  File "processing.py", line 44, in _write_references
f.write(rouge_summary_content(reference.id, reference.sents))
  File "processing.py", line 154, in rouge_summary_content
</html>""".format(title=title, elems="\n".join(sent_elems))
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

我改为：

sent_elems.append(elem.decode("utf-8", "ignore"))

以及

sent_elems.append(elem.decode("utf-8", "replace"))

仍然是同样的错误。

我查看了数据，但无法弄清楚为什么会这样。我检查了发生此错误的文件，但仍然没有非utf8字符。

我还在我的文件中添加了这个：

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

问题仍然存在。有什么建议吗？

Answer 1

我的标题是chr(65+index)，所以当它翻过所有大写字母时，它会打印一些非utf-8字符。我将其更改为str(index)，它解决了我原来的问题。

Answer 2

如果您的数据如下所示：

数据 =“0 \ X80 \ X06 \吨* \ x86H \ 86 \ XF7 \ r \ X01 \ X07 \ X04 \ XA0 \ X800 \ X80 \ X02 \ X01 \ X01 \ x0e0 \ X0C \ X06 \ b * \ x86H \ 86 \ XF7 \ r \ X02 \ X05 \ X05 .....“

按照以下方法，我们可以在utf8中解码

encoded = base64.b64encode(data) decoded = urllib.unquote(encoded).decode('utf8')

结果如下：

MIAGCSqGSIb3DQEHAq...

UnicodeDecodeError：＆＃39; utf8＆＃39;编解码器不能解码位置0中的字节0x80：无效的起始字节

2 个答案: