Question

我试图用xpath和cssselect解析example.com's home page，但似乎我不知道xpath是如何工作的，或者lxml的xpath被破坏了，因为它缺少匹配。

这是快速而肮脏的代码。

from lxml.html import *
mySearchTree = parse('http://www.example.com').getroot()
for a in mySearchTree.cssselect('tr a'):
    print 'found "%s" link to href "%s"' % (a.text, a.get('href'))

print '-'*8 +'Now for Xpath' + 8*'-'
# Find all 'a' elements inside 'tr' table rows with xpath
for a in mySearchTree.xpath('.//tr/*/a'):
    print 'found "%s" link to href "%s"' % (a.text, a.get('href'))

结果：

found "About" link to href "/about/"
found "Presentations" link to href "/about/presentations/"
found "Performance" link to href "/about/performance/"
found "Reports" link to href "/reports/"
found "Domains" link to href "/domains/"
found "Root Zone" link to href "/domains/root/"
found ".INT" link to href "/domains/int/"
found ".ARPA" link to href "/domains/arpa/"
found "IDN Repository" link to href "/domains/idn-tables/"
found "Protocols" link to href "/protocols/"
found "Number Resources" link to href "/numbers/"
found "Abuse Information" link to href "/abuse/"
found "Internet Corporation for Assigned Names and Numbers" link to href "http://www.icann.org/"
--------Now for Xpath--------
found "Presentations" link to href "/about/presentations/"
found "Performance" link to href "/about/performance/"
found "Reports" link to href "/reports/"
found "Root Zone" link to href "/domains/root/"
found ".INT" link to href "/domains/int/"
found ".ARPA" link to href "/domains/arpa/"
found "IDN Repository" link to href "/domains/idn-tables/"
found "Abuse Information" link to href "/abuse/"
found "Internet Corporation for Assigned Names and Numbers" link to href "http://www.icann.org/"

基本上xpath找到了它应该的每个链接，除了那些由Example.com加粗的链接。但是，星号通配符不应该在xpath匹配'。{tr / * / a'中允许这个吗？

Answer 1

可能正在发生其他事情（我没有仔细检查示例文档），但您的CSS选择器和XPath并不等效。

在XPath中，CSS tr a为//tr//a。 .//tr/*/a表示（概念上，不准确）：

.：当前节点
//：当前节点的所有后代
tr：当前节点的所有后代中的所有 tr 元素
/：找到 tr 元素
*：找到的 tr 元素
/：找到 tr 元素的所有子元素的所有子元素
a：所有 a 元素，它们是 tr 元素的元素子元素的元素子元素

换句话说，给定以下HTML：

<ul>
    <li><a href="link1"></a><li>
    <li><b><a href="link2"></a></b><li>
</ul>

//ul/*/a仅匹配 link1 。

XPath Primer

实际上，“XPath”是由斜杠分隔的一系列位置步骤。位置步骤包括：

轴（例如，孩子::）
节点测试（节点名称或特殊节点类型之一，例如node()，text()）
可选谓词（由[]包围。只有在所有谓词都为真的情况下才匹配节点。）

如果我们将.//tr/*/a分解为其位置步骤，它将如下所示：

.
（“//”中斜杠之间的“空格”）
tr
*
a

我所说的话可能并不明显。这是因为XPath有一个缩写语法。以下是扩展缩写的表达式（轴和节点测试由::分隔，步长为/）：

self::node()/descendent-or-self::node()/child::tr/child::*/child::a

（请注意self::node()是多余的。）

从概念上讲，步骤中会发生什么：

给定一组上下文节点（默认为当前节点或根节点为'/'）
对于每个上下文节点，创建一组满足位置步骤
将所有每个上下文节点集合联合到一个节点集
将该集合作为其给定的上下文节点传递给下一个位置步骤。
重复直至失步。最后一步之后的设置是整个路径的设置。

请注意，这仍然是一种简化。如果您需要，请阅读XPath Standard以了解详细信息。

Answer 2

'tr a' -> '//tr//a'

Python上的Python lxml等效解析方法的差异：css select vs xpath

2 个答案:

XPath Primer