我正在用python 3编写一个包装器。 测试它我发现用utf-8编码的html页面有点问题
Traceback (most recent call last):
File "/home/caiocesare/PycharmProjects/script/xpathExtraction.py", line 121,
in <module>
xpaths_extraction()
File "/home/caiocesare/PycharmProjects/script/xpathExtraction.py", line 117, in xpaths_extraction
xpaths = get_xpaths(html_file)
File "/home/caiocesare/PycharmProjects/script/xpathExtraction.py", line 21, in get_xpaths
content = clean.clean_html(htmlFile)
File "/home/caiocesare/PycharmProjects/script/htmlCleaner.py", line 8, in clean_html
return clean_parsed_html(parsed_html)
File "/home/caiocesare/PycharmProjects/script/htmlCleaner.py", line 24, in clean_parsed_html
refactored_url = cleaner.clean_html(parsed_html)
File "src/lxml/html/clean.py", line 520, in lxml.html.clean.Cleaner.clean_html
File "src/lxml/html/clean.py", line 396, in lxml.html.clean.Cleaner.__call__
File "/home/caiocesare/PycharmProjects/script/venv/lib/python3.6/site-packages/lxml/html/__init__.py", line 364, in drop_tag
if self.text and isinstance(self.tag, basestring):
File "src/lxml/etree.pyx", line 1014, in lxml.etree._Element.text.__get__
File "src/lxml/apihelpers.pxi", line 670, in lxml.etree._collectText
File "src/lxml/apihelpers.pxi", line 1405, in lxml.etree.funicode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 in position 98: invalid start byte
有趣的是该页面是使用utf-8。我的所有网址都是utf所以80000页面,当我开始使用80001解码器问题时utf-8没问题。 为什么我有这个错误并且有办法解决它?
def clean_html(url):
parsed_html = lxml.html.parse(url)
return clean_parsed_html(parsed_html)
def clean_parsed_html(parsed_html):
if parsed_html.getroot() == None:
return ""
cleaner = Cleaner()
cleaner.javascript = True
cleaner.style = True
cleaner.kill_tags = ['head', 'script', 'header', 'href', 'footer', 'a']
refactored_url = cleaner.clean_html(parsed_html)
return lxml.html.tostring(refactored_url)
答案 0 :(得分:0)
您可以尝试以下操作来覆盖要解析的文档的编码:
parsed_html = lxml.html.parse(url, parser=lxml.html.HTMLParser(encoding=CODEC))
对于占位符CODEC
,您可以指定文档的实际编码,例如。 "Latin-1"
。
这个解决方案有两个明显的缺点:
try/except
构造,首先尝试根据编码声明解析它们。所以这不是一个全自动的解决方案。 由于输入被打破,你可以做的并不多 - 没有任何魔法工具可以通过其他软件恢复任何错误,最好是有一个不错的启发式。 如果你有很多破碎的输入并且你并不特别关心它们,那么就跳过它们,这样你就可以处理这个合理的部分。