我正试图从网页中删除一个数字,特别是RealClearPolitics目前的总统支持率。
这是我正在使用的代码,尝试使用urllib2获取网页,使用lxml解析所有内容,并使用chrome报告的xpath。问题是,我最后得到的是一个空列表。
import urllib2
from lxml import etree
url = "http://www.realclearpolitics.com/epolls/other/president_obama_job_approval-1044.html"
page = urllib2.urlopen(url)
tree = etree.parse(page.content, etree.HTMLParser())
rcp=tree.xpath('//*[@id="polling-data-rcp"]/table/tbody/tr[2]/td[4]')
print rcp
任何帮助将不胜感激!
答案 0 :(得分:1)
tr[2]/td[4]
不对。参见:
因此您需要使用正确的XPath查询:
Python代码将是:
import requests
from lxml import html
URL = "http://www.realclearpolitics.com/epolls/other/president_obama_job_approval-1044.html"
response = requests.get(URL)
tree = html.fromstring(response.content)
rcp_approve = '//table[@class="chart_legend small_legend"]/tbody/tr/td[@class="candidate"][1]/div[1]/span/text()'
rcp_disapprove = '//table[@class="chart_legend small_legend"]/tbody/tr/td[@class="candidate"][2]/div[1]/span/text()'
rcp_approve = float(tree.xpath(rcp_approve)[0])
rcp_disapprove = float(tree.xpath(rcp_disapprove)[0])
print "Obama's approve rate: {}".format(rcp_approve)
print "Obama's disapprove rate: {}".format(rcp_disapprove)
输出:
Obama's approve rate: 44.4
Obama's disapprove rate: 51.6