使用XPath从表中选择使用Scrapy的HTML元素时遇到问题。 我使用的例子是Scrapy网站的基本示例:http://doc.scrapy.org/en/latest/intro/tutorial.html我要解析的网站是http://www.euroleague.net/main/results/showgame?gamecode=5&gamenumber=1&phasetypecode=RS&seasoncode=E2013#!playbyplay
起初我使用了这段代码:
from basketbase.items import BasketbaseItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import HtmlResponse
class Basketspider(CrawlSpider):
name = "playbyplay"
download_delay = 0.5
allowed_domains = ["www.euroleague.net"]
start_urls = ["http://www.euroleague.net/main/results/showgame?gamenumber=1&phasetypecode=RS&gamecode=4&seasoncode=E2013"]
rules = (
Rule(SgmlLinkExtractor(allow=(),),callback='parse_item',),
)
def parse(self,response):
response = HtmlResponse(url=response.url, status=response.status, headers=response.headers, body=response.body)
return super(Basketspider,self).parse(response)
def parse_item(self, response):
response = HtmlResponse(url=response.url, status=response.status, headers=response.headers, body=response.body)
sel = HtmlXPathSelector(response)
items=[]
item = BasketbaseItem()
item['game_time'] = sel.select('//div[@style="overflow: auto; height: 250px; width: 800px;"]/table/tbody/tr/td[1]/text()').extract() #
item['game_event'] = sel.select('//div[@style="overflow: auto; height: 250px; width: 800px;"]/table/tbody/tr/td[2]/text()').extract() #
item['game_event_res_home'] = sel.select('//div[@style="overflow: auto; height: 250px; width: 800px;"]/table/tbody/tr/td[3]/text()').extract() #
item['game_event_res_visitor'] = sel.select('//div[@style="overflow: auto; height: 250px; width: 800px;"]/table/tbody/tr/td[3]/text()').extract() #
item['game_event_team'] = sel.select('//div[@style="overflow: auto; height: 250px; width: 800px;"]/table/tbody/tr/td[4]/text()').extract() #
item['game_event_player'] = sel.select('//div[@style="overflow: auto; height: 250px; width: 800px;"]/table/tbody/tr/td[5]/text()').extract() #
items.append(item)
return items
这是基本的,规则在这个时候不是很正确,但这个例子的主要关注点是XPath。
它有效,但不是我想要的方式。 我希望每个项目只提取一个td / tr的值,但是使用此代码,它会立即将所有td元素提取到项目中。 项目game_event_res_visitor:
'game_event_res_visitor': [u'0-0',
u'0-0',
u'0-0',.......(list goes on and on)
为了获得我想要的结果,我决定使用循环(就像在Scrapy教程(http://doc.scrapy.org/en/latest/intro/tutorial.html)中),但它根本不返回任何值。这是代码:
def parse(self,response):
response = HtmlResponse(url=response.url, status=response.status, headers=response.headers, body=response.body)
return super(Basketspider,self).parse(response)
def parse_item(self, response):
response = HtmlResponse(url=response.url, status=response.status, headers=response.headers, body=response.body)
sel = HtmlXPathSelector(response)
sites = sel.xpath('//div[@style="overflow: auto; height: 250px; width: 800px;"]/table/tbody/tr')
items=[]
item = BasketbaseItem()
for site in sites:
item = BasketbaseItem()
item['game_time'] = sel.select('td[1]/text()').extract() #
item['game_event'] = sel.select('td[2]/text()').extract() #
item['game_event_res_home'] = sel.select('td[3]/text()').extract() #
item['game_event_res_visitor'] = sel.select('td[3]/text()').extract() #
item['game_event_team'] = sel.select('td[4]/text()').extract() #
item['game_event_player'] = sel.select('td[5]/text()').extract() #
items.append(item)
return items
和终端输出:
2014-03-07 16:57:45+0200 [playbyplay] DEBUG: Scraped from <200 http://www.euroleague.net/main/results/showgame?gamecode=9&gamenumber=1&phasetypecode=RS&seasoncode=E2013>
{'game_event': [],
'game_event_player': [],
'game_event_res_home': [],
'game_event_res_visitor': [],
'game_event_team': [],
'game_time': []}
2014-03-07 16:57:45+0200 [playbyplay] DEBUG: Scraped from <200 http://www.euroleague.net/main/results/showgame?gamecode=9&gamenumber=1&phasetypecode=RS&seasoncode=E2013>
{'game_event': [],
'game_event_player': [],
'game_event_res_home': [],
'game_event_res_visitor': [],
'game_event_team': [],
'game_time': []}
我明白我的XPath有问题,但我不明白。如果我在item元素中使用相对XPath,它给出的结果与第一个示例中的结果相同。所以它就在那里,但我无法用我拥有的代码达到它。我甚至试过“外卡”。
item['game_time'] = sel.select('*/text()').extract() #
item['game_event'] = sel.select('*/text()').extract() #
item['game_event_res_home'] = sel.select('*/text()').extract() #
item['game_event_res_visitor'] = sel.select('*/text()').extract() #
item['game_event_team'] = sel.select('*/text()').extract() #
item['game_event_player'] = sel.select('*/text()').extract() #
未能获得任何文字结果。
2014-03-07 19:11:14+0200 [playbyplay] DEBUG: Scraped from <200 http://www.euroleague.net/main/results/showgame?gamecode=7&gamenumber=1&phasetypecode=RS&seasoncode=E2013>
{'game_event': [u' \r\n', u'\r\n'],
'game_event_player': [u' \r\n', u'\r\n'],
'game_event_res_home': [u' \r\n', u'\r\n'],
'game_event_res_visitor': [u' \r\n', u'\r\n'],
'game_event_team': [u' \r\n', u'\r\n'],
'game_time': [u' \r\n', u'\r\n']}
我很困惑,我不明白我的XPath或我的代码有什么问题。
答案 0 :(得分:3)
这对我有用:
def parse_item(self, response):
response = HtmlResponse(url=response.url, status=response.status, headers=response.headers, body=response.body)
sel = HtmlXPathSelector(response)
rows = sel.select('//div[@style="overflow: auto; height: 250px; width: 800px;"]/table/tbody/tr')
for row in rows:
item = BasketbaseItem()
item['game_time'] = row.select("td[1]/text()").extract()[0]
item['game_event'] = row.select("td[2]/text()").extract()[0]
result = row.select("td[3]/text()").extract()[0]
item['game_event_res_home'], item['game_event_res_visitor'] = result.split('-')
item['game_event_team'] = row.select("td[4]/text()").extract()[0]
item['game_event_player'] = row.select("td[5]/text()").extract()[0]
yield item
这是我得到的一个示例项目:
{'game_event': u'Steal',
'game_event_player': u'DJEDOVIC, NIHAD',
'game_event_res_home': u'0 ',
'game_event_res_visitor': u' 0',
'game_event_team': u'FC Bayern Munich',
'game_time': u'2'}
对于你来说,这只是一个开始 - 有时由于IndexError
异常而没有产生项目 - 正确处理它。
希望有所帮助。