从xpath中删除信息?

时间:2016-05-06 17:16:42

标签: python python-2.7 xpath html-parsing

我使用以下代码行从网页中获取CVE ID:

  project.cve_information = "".join(xpath_parse(tree, '//div[@id="references"]/a/text()')).split()

但问题是:

            <div id='references'>
            <b>References:</b>
            <a href='https://access.redhat.com/security/cve/CVE-2011-3256' target='_blank'>CVE-2011-3256&nbsp;<i class='icon-external-link'></i></a>
            <a href='https://rhn.redhat.com/errata/RHSA-2011-1402.html' target='_blank'>RHSA-2011-1402&nbsp;<i class='icon-external-link'></i></a><br />
        </div>

参考文献:CVE-xxxx-xxxx RHSA-xxxx-xxxx

如何避免RHSA和此类条目被解析?我只想要CVE-xxxx-xxxx值。我用它来提交这样的表格:

          "form[CVEID]" : ",".join(self.cve_information) if self.cve_information else "GENERIC-MAP-NOMATCH",

此表单仅对CVE值和错误执行验证,因为我的代码往往包含RHSA值。

1 个答案:

答案 0 :(得分:1)

您可以使用包含

h = """ <div id='references'>
            <b>References:</b>
            <a href='https://access.redhat.com/security/cve/CVE-2011-3256' target='_blank'>CVE-2011-3256&nbsp;<i class='icon-external-link'></i></a>
            <a href='https://rhn.redhat.com/errata/RHSA-2011-1402.html' target='_blank'>RHSA-2011-1402&nbsp;<i class='icon-external-link'></i></a><br />
        </div>"""

from lxml import html

xml = html.fromstring(h)

urls = xml.xpath('//div[@id="references"]/a[contains(@href, "CVE")]/@href')

或者如果您想忽略使用RHSA的href,您可以使用 not contains

urls = xml.xpath('//div[@id="references"]/a[not(contains(@href, "RHSA"))]/@href')

两者都会给你:

 ['https://access.redhat.com/security/cve/CVE-2011-3256']