使用XPath检索span类中的文本

时间:2014-09-11 19:06:59

标签: python xml xpath

我正在使用python从网站中检索各种指标(例如,喜欢,推特共享等)。虽然XPath检索文本很好,但我遇到了这些指标(跨度内的文本)的问题。

<span class="pluginCountTextDisconnected">78</span>

现在我需要得到“78”,但是当我把它送到XPath时,Python不会返回任何内容。

这是XPath,以防万一:

//*[@id="u_0_2"]/span[2]

Python代码:

from lxml import html
import urllib2  
from unicsv import CsvUnicodeReader

req=urllib2.Request("http://www.nu.nl/binnenland/3866370/reddingsbrigade-redt-369-mensen-zomer-.html")
tree = html.fromstring(urllib2.urlopen(req).read())
fb_likes = tree.xpath('//*[@id="u_0_2"]/span[2]')
print fb_likes

2 个答案:

答案 0 :(得分:0)

/text()添加到xpath:

//*[@id="u_0_2"]/span[2]/text()

答案 1 :(得分:0)

您的范围位于iframe,因此您需要在iframe内部获取文字(顺便说一下,//span[@class='pluginCountTextDisconnected']/text()是正确的方式,但您在iframe之外)。所以你需要阅读src之类的:

a = html.fromstring(urllib2.urlopen("http://www.nu.nl/binnenland/3866370/reddingsbrigade-redt-369-mensen-zomer-.htm").read())
iframe = html.fromstring(urllib2.urlopen(a.iframe["src"]).read())
fb_likes = iframe .xpath("//span[@class='pluginCountTextDisconnected']/text()")
抱歉,没有测试代码,这只是一个普遍的想法。

更新

import urllib2, lxml.html

iframe_asfile = urllib2.urlopen('http://www.facebook.com/plugins/like.php?action=recommend&app_id=&channel=http%3A%2F%2Fstatic.ak.facebook.com%2Fconnect%2Fxd_arbiter%2FZEbdHPQfV3x.js%3Fversion%3D41%23cb%3Df112fd0c7b19666%26domain%3Dwww.nu.nl%26origin%3Dhttp%253A%252F%252Fwww.nu.nl%252Ff62d30922cee5%26relation%3Dparent.parent&href=http%3A%2F%2Fwww.nu.nl%2Fbinnenland%2F3866370%2Freddingsbrigade-redt-369-mensen-zomer-.html&layout=box_count&locale=nl_NL&sdk=joey&send=false&show_faces=true&width=75')
iframe_data = iframe_asfile.read()
iframe_asfile.close()

iframe_html = lxml.html.document_fromstring(iframe_data)

fb_likes = iframe_html.xpath(".//span[@class='pluginCountTextDisconnected']/text()")
print fb_likes[0]

打印78