Question

我正在用python 3编写一个包装器。测试它我发现用utf-8编码的html页面有点问题

Traceback (most recent call last):
File "/home/caiocesare/PycharmProjects/script/xpathExtraction.py", line 121, 
in <module>
    xpaths_extraction()
  File "/home/caiocesare/PycharmProjects/script/xpathExtraction.py", line 117, in xpaths_extraction
    xpaths = get_xpaths(html_file)
  File "/home/caiocesare/PycharmProjects/script/xpathExtraction.py", line 21, in get_xpaths
    content = clean.clean_html(htmlFile)
  File "/home/caiocesare/PycharmProjects/script/htmlCleaner.py", line 8, in clean_html
    return clean_parsed_html(parsed_html)
  File "/home/caiocesare/PycharmProjects/script/htmlCleaner.py", line 24, in clean_parsed_html
    refactored_url = cleaner.clean_html(parsed_html)
  File "src/lxml/html/clean.py", line 520, in lxml.html.clean.Cleaner.clean_html
  File "src/lxml/html/clean.py", line 396, in lxml.html.clean.Cleaner.__call__
  File "/home/caiocesare/PycharmProjects/script/venv/lib/python3.6/site-packages/lxml/html/__init__.py", line 364, in drop_tag
    if self.text and isinstance(self.tag, basestring):
  File "src/lxml/etree.pyx", line 1014, in lxml.etree._Element.text.__get__
  File "src/lxml/apihelpers.pxi", line 670, in lxml.etree._collectText
  File "src/lxml/apihelpers.pxi", line 1405, in lxml.etree.funicode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 in position 98: invalid start byte

有趣的是该页面是使用utf-8。我的所有网址都是utf所以80000页面，当我开始使用80001解码器问题时utf-8没问题。为什么我有这个错误并且有办法解决它？

def clean_html(url):
    parsed_html = lxml.html.parse(url)
    return clean_parsed_html(parsed_html)

def clean_parsed_html(parsed_html):
    if parsed_html.getroot() == None:
        return ""

    cleaner = Cleaner()
    cleaner.javascript = True
    cleaner.style = True
    cleaner.kill_tags = ['head', 'script', 'header', 'href', 'footer', 'a']
    refactored_url = cleaner.clean_html(parsed_html)
    return lxml.html.tostring(refactored_url)

Answer 1

您可以尝试以下操作来覆盖要解析的文档的编码：

parsed_html = lxml.html.parse(url, parser=lxml.html.HTMLParser(encoding=CODEC))

对于占位符CODEC，您可以指定文档的实际编码，例如。 "Latin-1"。

这个解决方案有两个明显的缺点：

你需要找出文件的实际内容。这意味着一些试验和错误。例如，在浏览器中打开文档（最好是在源代码视图中）并使用浏览器的菜单更改编码（在Firefox中，这是在视图中）。如果有更多这样的文件，你将不得不重复这个过程。
您需要特别处理有问题的文档，作为某种后备。最好使用try/except构造，首先尝试根据编码声明解析它们。

所以这不是一个全自动的解决方案。由于输入被打破，你可以做的并不多 - 没有任何魔法工具可以通过其他软件恢复任何错误，最好是有一个不错的启发式。如果你有很多破碎的输入并且你并不特别关心它们，那么就跳过它们，这样你就可以处理这个合理的部分。

UnicodeDecodeError：'utf-8'编解码器无法解码位置98中的字节0xb1：无效的起始字节

1 个答案: