我试图用xpath和cssselect解析example.com's home page,但似乎我不知道xpath是如何工作的,或者lxml的xpath被破坏了,因为它缺少匹配。
这是快速而肮脏的代码。
from lxml.html import *
mySearchTree = parse('http://www.example.com').getroot()
for a in mySearchTree.cssselect('tr a'):
print 'found "%s" link to href "%s"' % (a.text, a.get('href'))
print '-'*8 +'Now for Xpath' + 8*'-'
# Find all 'a' elements inside 'tr' table rows with xpath
for a in mySearchTree.xpath('.//tr/*/a'):
print 'found "%s" link to href "%s"' % (a.text, a.get('href'))
结果:
found "About" link to href "/about/"
found "Presentations" link to href "/about/presentations/"
found "Performance" link to href "/about/performance/"
found "Reports" link to href "/reports/"
found "Domains" link to href "/domains/"
found "Root Zone" link to href "/domains/root/"
found ".INT" link to href "/domains/int/"
found ".ARPA" link to href "/domains/arpa/"
found "IDN Repository" link to href "/domains/idn-tables/"
found "Protocols" link to href "/protocols/"
found "Number Resources" link to href "/numbers/"
found "Abuse Information" link to href "/abuse/"
found "Internet Corporation for Assigned Names and Numbers" link to href "http://www.icann.org/"
--------Now for Xpath--------
found "Presentations" link to href "/about/presentations/"
found "Performance" link to href "/about/performance/"
found "Reports" link to href "/reports/"
found "Root Zone" link to href "/domains/root/"
found ".INT" link to href "/domains/int/"
found ".ARPA" link to href "/domains/arpa/"
found "IDN Repository" link to href "/domains/idn-tables/"
found "Abuse Information" link to href "/abuse/"
found "Internet Corporation for Assigned Names and Numbers" link to href "http://www.icann.org/"
基本上xpath找到了它应该的每个链接,除了那些由Example.com加粗的链接。但是,星号通配符不应该在xpath匹配'。{tr / * / a'中允许这个吗?
答案 0 :(得分:3)
可能正在发生其他事情(我没有仔细检查示例文档),但您的CSS选择器和XPath并不等效。
在XPath中,CSS tr a
为//tr//a
。 .//tr/*/a
表示(概念上,不准确):
.
:当前节点//
:当前节点的所有后代tr
:当前节点的所有后代中的所有 tr 元素/
:找到 tr 元素*
:找到的 tr 元素/
:找到 tr 元素的所有子元素的所有子元素a
:所有 a 元素,它们是 tr 元素的元素子元素的元素子元素换句话说,给定以下HTML:
<ul>
<li><a href="link1"></a><li>
<li><b><a href="link2"></a></b><li>
</ul>
//ul/*/a
仅匹配 link1 。
实际上,“XPath”是由斜杠分隔的一系列位置步骤。位置步骤包括:
node()
,text()
)[]
包围。只有在所有谓词都为真的情况下才匹配节点。)如果我们将.//tr/*/a
分解为其位置步骤,它将如下所示:
.
tr
*
a
我所说的话可能并不明显。这是因为XPath有一个缩写语法。以下是扩展缩写的表达式(轴和节点测试由::
分隔,步长为/
):
self::node()/descendent-or-self::node()/child::tr/child::*/child::a
(请注意self::node()
是多余的。)
从概念上讲,步骤中会发生什么:
请注意,这仍然是一种简化。如果您需要,请阅读XPath Standard以了解详细信息。
答案 1 :(得分:1)
'tr a' -> '//tr//a'