禁用评论检查' - '在lxml中

时间:2016-01-04 16:14:46

标签: python web-scraping lxml html5lib

使用案例:

使用lxml无法解析https://www.banca-romaneasca.ro/en/tools-and-resources/

...
/opt/python-env/ciur/local/lib/python2.7/site-packages/html5lib/html5parser.py:468: in processComment
    self.tree.insertComment(token, self.tree.openElements[-1])
/opt/python-env/ciur/local/lib/python2.7/site-packages/html5lib/treebuilders/etree_lxml.py:312: in insertCommentMain
    super(TreeBuilder, self).insertComment(data, parent)
/opt/python-env/ciur/local/lib/python2.7/site-packages/html5lib/treebuilders/_base.py:262: in insertComment
    parent.appendChild(self.commentClass(token["data"]))
/opt/python-env/ciur/local/lib/python2.7/site-packages/html5lib/treebuilders/etree.py:148: in __init__
    self._element = ElementTree.Comment(data)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

- src/lxml/lxml.etree.pyx:3017: ValueError: Comment may not contain '--' or end with '-'

来自lxml> https://github.com/lxml/lxml/blob/master/src/lxml/lxml.etree.pyx#L3017

https://www.banca-romaneasca.ro/en/tools-and-resources/

中找不到评论
...
<script type="text/javascript" src="/_res/js/forms.js"></script>

<!-- Google Code for Remarketing Tag -->
<!--------------------------------------------------
Remarketing tags may not be associated with personally identifiable information or placed on pages related to sensitive categories. See more information and instructions on how to setup the tag on: http://google.com/ads/remarketingsetup
--------------------------------------------------->
<script type="text/javascript">
/* <![CDATA[ */
var google_conversion_id = 958631629;
var google_custom_params = window.google_tag_params;
... 

请求解决方案:

  • 禁用检查(某些魔法,标记,在xml上)

    if b'--' in text or text.endswith(b'-'):
        raise ValueError("Comment may not contain '--' or end with '-'")
    
  • 猴子修补(更改代码,注入...)

更新1:

我使用html5lib并希望获得声音,部分,视频等标签,以html5格式提供。

from lxml.html import html5parser, fromstring

context = fromstring(document.content) # work    
context = html5parser.fromstring(document.content) # do not work

context = html5lib.parse(  # do not work
    document.content,
    treebuilder="lxml",
    namespaceHTMLElements=document.namespace,
    encoding=document.encoding
)

版本:

  • html5lib == 0.9999999
  • lxml == 3.5.0(降级lxml也不是解决方案)

更新2 ::

这似乎是lxml https://github.com/lxml/lxml/pull/172#issuecomment-169084439中的改进/问题。

等待lxml开发人员反馈。

更新3 ::

得到反馈,似乎是html5lib错误,github的最后一个开发版本已经修复了。

2 个答案:

答案 0 :(得分:2)

已找到解决方案,基于来自github的@opottone:

我尝试从github安装最新的html5parser。现在我只收到警告,而不是错误。

答案 1 :(得分:1)

由于这是您尝试解析的HTML数据,请使用lxml.html而不是lxml.etree

为我工作:

>>> import requests
>>> import lxml.html
>>> 
>>> data = requests.get("https://www.banca-romaneasca.ro/en/tools-and-resources/").content
>>> tree = lxml.html.fromstring(data)
>>> tree.xpath("//title/text()")
['Tools and resources - Banca Romaneasca']