Question

我在BeautifulSoup中使用以下内容来从https://alas.aws.amazon.com/ALAS-2015-530.html

获取文本

description = " ".join(xpath_parse(tree, '//div[@id="issue_overview"]/p/text()')).replace('. ()', '.\n')

但是，内容将被删除所有HTML标记。我得到 - “正如所讨论的，Ruby的OpenSSL扩展通过过度允许的主机名匹配而遭受漏洞，这可能是lea d。类似的错误，例如。“

我的xpath_parse很简单：

    def xpath_parse(tree, xfilter):
  return tree.xpath(xfilter)

有人可以告诉我为什么会这样吗？

Answer 1

由于/text()部分，它会使所有文本节点直接位于/div[@id="issue_overview"]/p下。

相反，假设您使用的是lxml.html包，请使用.text_content()方法：

返回元素的文本内容，包括其子元素的文本内容，没有标记。

tree.xpath('//div[@id="issue_overview"]')[0].text_content()

演示：

>>> from lxml.html import fromstring
>>> import requests
>>>
>>> url = "https://alas.aws.amazon.com/ALAS-2015-530.html"
>>> response = requests.get(url)
>>> root = fromstring(response.content)
>>> overview = root.xpath('//div[@id="issue_overview"]')[0].text_content().replace("Issue Overview:", "").strip()
>>> print(overview)                                                                                                                                                                                      
As discussed in an upstream announcement, Ruby's OpenSSL extension suffers a vulnerability through overly permissive matching of hostnames, which can lead to similar bugs such as CVE-2014-1492 .

或者，如果您需要获取元素的标记 - 请使用tostring()方法：

>>> from lxml.html import fromstring, tostring
>>> tostring(root.xpath('//div[@id="issue_overview"]/p')[0])
'<p>As discussed in <a href="https://www.ruby-lang.org/en/news/2015/04/13/ruby-openssl-hostname-matching-vulnerability/">an upstream announcement</a>, Ruby\'s OpenSSL extension suffers a vulnerability through overly permissive matching of hostnames, which can lead to similar bugs such as <a href="https://access.redhat.com/security/cve/CVE-2014-1492" target="_blank">CVE-2014-1492 <i class="icon-external-link"></i></a>.</p>\n            '

并且在删除i元素之后：

>>> overview = root.xpath('//div[@id="issue_overview"]/p')[0]
>>> for i in overview.xpath(".//i"):
...     i.getparent().remove(i)
... 
>>> tostring(overview)
'<p>As discussed in <a href="https://www.ruby-lang.org/en/news/2015/04/13/ruby-openssl-hostname-matching-vulnerability/">an upstream announcement</a>, Ruby\'s OpenSSL extension suffers a vulnerability through overly permissive matching of hostnames, which can lead to similar bugs such as <a href="https://access.redhat.com/security/cve/CVE-2014-1492" target="_blank">CVE-2014-1492 </a>.</p>\n            '

XPath剥离HTML标记？

1 个答案: