我在xpath的帮助下提取了以下内容:
In [206]: list = tree.xpath('/html/body/div[@id="gs_top"]/div[@id="gs_bdy"]/div[@id="gs_ccl"]/div[@id="gsc_ccl"]/div[@class="gsc_1usr gs_scl"]/div[@class="gsc_1usr_text"]/h3[@class="gsc_1usr_name"]/a')
In [208]: for item in list:
print(etree.tostring(item, pretty_print=True))
.....:
<a href="/citations?user=lMkTx0EAAAAJ&hl=en&oe=ASCII">Jason Weston</a>
<a href="/citations?user=RhFhIIgAAAAJ&hl=en&oe=ASCII">Pierre Baldi</a>
<a href="/citations?user=9DXQi8gAAAAJ&hl=en&oe=ASCII">Yair Weiss</a>
<a href="/citations?user=J8YyZugAAAAJ&hl=en&oe=ASCII">Peter Belhumeur</a>
<a href="/citations?user=ORr4XJYAAAAJ&hl=en&oe=ASCII">Serge Belongie</a>
现在我可以通过在/@href
的帮助下附加text()
或文字来提取href。但是我怎样才能一次性得到它们,如答案所示:How to select two attributes from the same node with one expression in XPath?
答案 0 :(得分:1)
只需以这种方式对每个元素调用.xpath("@href|text()")
:
for item in list:
href, text = item.xpath("@href|text()")
print(href, text)
演示:
>>> from lxml.html import fromstring
>>>
>>> data = """
... <body>
... <a href="/citations?user=lMkTx0EAAAAJ&hl=en&oe=ASCII">Jason Weston</a>
... <a href="/citations?user=RhFhIIgAAAAJ&hl=en&oe=ASCII">Pierre Baldi</a>
... <a href="/citations?user=9DXQi8gAAAAJ&hl=en&oe=ASCII">Yair Weiss</a>
... <a href="/citations?user=J8YyZugAAAAJ&hl=en&oe=ASCII">Peter Belhumeur</a>
... <a href="/citations?user=ORr4XJYAAAAJ&hl=en&oe=ASCII">Serge Belongie</a>
... </body>
... """
>>>
>>> tree = fromstring(data)
>>>
>>> for item in tree.xpath("//a"):
... print(item.xpath("@href|text()"))
...
['/citations?user=lMkTx0EAAAAJ&hl=en&oe=ASCII', 'Jason Weston']
['/citations?user=RhFhIIgAAAAJ&hl=en&oe=ASCII', 'Pierre Baldi']
['/citations?user=9DXQi8gAAAAJ&hl=en&oe=ASCII', 'Yair Weiss']
['/citations?user=J8YyZugAAAAJ&hl=en&oe=ASCII', 'Peter Belhumeur']
['/citations?user=ORr4XJYAAAAJ&hl=en&oe=ASCII', 'Serge Belongie']