我在BeautifulSoup中使用以下内容来从https://alas.aws.amazon.com/ALAS-2015-530.html
获取文本description = " ".join(xpath_parse(tree, '//div[@id="issue_overview"]/p/text()')).replace('. ()', '.\n')
但是,内容将被删除所有HTML标记。我得到 - “正如所讨论的,Ruby的OpenSSL扩展通过过度允许的主机名匹配而遭受漏洞,这可能是lea d。类似的错误,例如。“
我的xpath_parse很简单:
def xpath_parse(tree, xfilter):
return tree.xpath(xfilter)
有人可以告诉我为什么会这样吗?
答案 0 :(得分:2)
由于/text()
部分,它会使所有文本节点直接位于/div[@id="issue_overview"]/p
下。
相反,假设您使用的是lxml.html
包,请使用.text_content()
方法:
返回元素的文本内容,包括其子元素的文本内容,没有标记。
tree.xpath('//div[@id="issue_overview"]')[0].text_content()
演示:
>>> from lxml.html import fromstring
>>> import requests
>>>
>>> url = "https://alas.aws.amazon.com/ALAS-2015-530.html"
>>> response = requests.get(url)
>>> root = fromstring(response.content)
>>> overview = root.xpath('//div[@id="issue_overview"]')[0].text_content().replace("Issue Overview:", "").strip()
>>> print(overview)
As discussed in an upstream announcement, Ruby's OpenSSL extension suffers a vulnerability through overly permissive matching of hostnames, which can lead to similar bugs such as CVE-2014-1492 .
或者,如果您需要获取元素的标记 - 请使用tostring()
方法:
>>> from lxml.html import fromstring, tostring
>>> tostring(root.xpath('//div[@id="issue_overview"]/p')[0])
'<p>As discussed in <a href="https://www.ruby-lang.org/en/news/2015/04/13/ruby-openssl-hostname-matching-vulnerability/">an upstream announcement</a>, Ruby\'s OpenSSL extension suffers a vulnerability through overly permissive matching of hostnames, which can lead to similar bugs such as <a href="https://access.redhat.com/security/cve/CVE-2014-1492" target="_blank">CVE-2014-1492 <i class="icon-external-link"></i></a>.</p>\n '
并且在删除i
元素之后:
>>> overview = root.xpath('//div[@id="issue_overview"]/p')[0]
>>> for i in overview.xpath(".//i"):
... i.getparent().remove(i)
...
>>> tostring(overview)
'<p>As discussed in <a href="https://www.ruby-lang.org/en/news/2015/04/13/ruby-openssl-hostname-matching-vulnerability/">an upstream announcement</a>, Ruby\'s OpenSSL extension suffers a vulnerability through overly permissive matching of hostnames, which can lead to similar bugs such as <a href="https://access.redhat.com/security/cve/CVE-2014-1492" target="_blank">CVE-2014-1492 </a>.</p>\n '