XPath剥离HTML标记?

时间:2016-05-17 22:12:48

标签: python python-2.7 xpath

我在BeautifulSoup中使用以下内容来从https://alas.aws.amazon.com/ALAS-2015-530.html

获取文本
description = " ".join(xpath_parse(tree, '//div[@id="issue_overview"]/p/text()')).replace('. ()', '.\n')

但是,内容将被删除所有HTML标记。我得到 - “正如所讨论的,Ruby的OpenSSL扩展通过过度允许的主机名匹配而遭受漏洞,这可能是lea d。类似的错误,例如。“

我的xpath_parse很简单:

    def xpath_parse(tree, xfilter):
  return tree.xpath(xfilter)

有人可以告诉我为什么会这样吗?

1 个答案:

答案 0 :(得分:2)

由于/text()部分,它会使所有文本节点直接位于/div[@id="issue_overview"]/p下。

相反,假设您使用的是lxml.html包,请使用.text_content()方法:

  

返回元素的文本内容,包括其子元素的文本内容,没有标记。

tree.xpath('//div[@id="issue_overview"]')[0].text_content()

演示:

>>> from lxml.html import fromstring
>>> import requests
>>>
>>> url = "https://alas.aws.amazon.com/ALAS-2015-530.html"
>>> response = requests.get(url)
>>> root = fromstring(response.content)
>>> overview = root.xpath('//div[@id="issue_overview"]')[0].text_content().replace("Issue Overview:", "").strip()
>>> print(overview)                                                                                                                                                                                      
As discussed in an upstream announcement, Ruby's OpenSSL extension suffers a vulnerability through overly permissive matching of hostnames, which can lead to similar bugs such as CVE-2014-1492 .

或者,如果您需要获取元素的标记 - 请使用tostring()方法:

>>> from lxml.html import fromstring, tostring
>>> tostring(root.xpath('//div[@id="issue_overview"]/p')[0])
'<p>As discussed in <a href="https://www.ruby-lang.org/en/news/2015/04/13/ruby-openssl-hostname-matching-vulnerability/">an upstream announcement</a>, Ruby\'s OpenSSL extension suffers a vulnerability through overly permissive matching of hostnames, which can lead to similar bugs such as <a href="https://access.redhat.com/security/cve/CVE-2014-1492" target="_blank">CVE-2014-1492 <i class="icon-external-link"></i></a>.</p>\n            '

并且在删除i元素之后:

>>> overview = root.xpath('//div[@id="issue_overview"]/p')[0]
>>> for i in overview.xpath(".//i"):
...     i.getparent().remove(i)
... 
>>> tostring(overview)
'<p>As discussed in <a href="https://www.ruby-lang.org/en/news/2015/04/13/ruby-openssl-hostname-matching-vulnerability/">an upstream announcement</a>, Ruby\'s OpenSSL extension suffers a vulnerability through overly permissive matching of hostnames, which can lead to similar bugs such as <a href="https://access.redhat.com/security/cve/CVE-2014-1492" target="_blank">CVE-2014-1492 </a>.</p>\n            '