Question

我正在尝试查询某些HTML以查找以某种方式包含“下载”一词的链接。所以它可以在

id
class
href
文字
a标记内的任何html。

所以使用Python lxml library它应该找到test-html中的所有7个链接：

html = """
<html>
<head></head>
<body>
1 <a href="/test1" id="download">test 1</a>
2 <a href="/test2" class="download">test 2</a>
3 <a href="/download">test 3</a>
4 <a href="/test4">DoWnLoAd</a>
5 <a href="/test5">ascascDoWnLoAdsacsa</a>
6 <a href="/test6"><div id="test6">download</div></a>
7 <a href="/test7"><div id="download">test7</div></a>
</body>
</html>
"""

from lxml import etree

tree = etree.fromstring(html, etree.HTMLParser())
downloadElementConditions = "//a[(@id|@class|@href|text())[contains(translate(.,'DOWNLOAD','download'), 'download')]]"
elements = tree.xpath(downloadElementConditions)

print 'FOUND ELEMENTS:', len(elements)
for i in elements:
    print i.get('href'), i.text

如果这样运行，它只找到前五个元素。这意味着如果文本不包含更多html，xpath只能在文本中找到“download”。

有没有办法将a标记的内容视为常规字符串，看看是否包含“下载”？欢迎所有提示！

[编辑]

使用下面的heinst答案中的提示我编辑了下面的代码。这现在有效，但它不是很优雅。有人知道纯xpath中的解决方案吗？

from lxml import etree
tree = etree.fromstring(html, etree.HTMLParser())
downloadElementConditions = "//*[(@id|@class|@href|text())[contains(translate(.,'DOWNLOAD','download'), 'download')]]"
elements = tree.xpath(downloadElementConditions)

print 'FOUND ELEMENTS:', len(elements)
for el in elements:
    href = el.get('href')
    if href:
        print el.get('href'), el.text
    else:
        elparent = el
        for _ in range(10):  # loop over 10 parents
            elparent = elparent.getparent()
            href = elparent.get('href')
            if href:
                print elparent.get('href'), elparent.text
                break

Answer 1

纯XPath解决方案

将text()更改为.并在descendent-or-self轴上搜索属性：

//a[(.|.//@id|.//@class|.//@href)[contains(translate(.,'DOWNLOAD','download'),'download')]]

<强>解释：

text() vs . ：此处text()将匹配a的直接文本节点子项; .将匹配a元素的字符串值。在为了捕获a子元素的情况包含目标文本，您希望匹配字符串值 a。
descendant-or-self ：为了匹配a及其任何后代的属性，descendant-or-self轴（.//）是使用。

有关XPath中字符串值的更多详细信息，请参阅Matching text nodes is different than matching string values.

Answer 2

将Xpath选项从严格匹配的a标记更改为通配符应该可以解决问题： "//*[(@id|@class|@href|text())[contains(translate(.,'DOWNLOAD','download'), 'download')]]"

搜索字符串的元素和属性

2 个答案:

纯XPath解决方案