我正在使用scrapy构建一个简单的网络刮刀,以便从BBC网站获得足球队的结果。页面(http://www.bbc.com/sport/football/teams/bolton-wanderers/results)中的相关HTML是:
<tr class="report" id="match-row-EFBO755964">
<td class="statistics show" title="Show latest match stats">
<button>Show</button>
</td>
<td class="match-competition"> Championship </td>
<td class="match-details teams">
<p>
<span class="team-home teams"> <a href="/sport/football/teams/huddersfield-town">Huddersfield</a> </span>
<span class="score"> <abbr title="Score"> 2-1 </abbr> </span>
<span class="team-away teams"> <a href="/sport/football/teams/bolton-wanderers">Bolton</a> </span>
</p>
</td>
<td class="match-date"> Sun 28 Dec </td>
<td class="time"> Full time </td>
<td class="status"> <a class="report" href="/sport/football/30566395">Report</a>
</td>
</tr>
当我尝试使用scrapy shell进行抓取时,这是输出:
$ scrapy shell http://www.bbc.com/sport/football/teams/bolton-wanderers/results
>>> response.selector.xpath('//tr[@class="report"]/td[@class="match-date"]/text()').extract()
[u' Sun 28 Dec ', u' Fri 26 Dec ', u' Fri 19 Dec ', u' Sat 13 Dec ',...]
但是,当我在蜘蛛中使用相同的xpath时,我无法获得这些日期。 这是项目:
class resultsItem(scrapy.Item):
date = scrapy.Field()
homeTeam = scrapy.Field()
score = scrapy.Field()
awayTeam = scrapy.Field()
这是蜘蛛:
class resultsSpider(scrapy.Spider):
name = "results"
allowed_domains = ["bbc.com"]
start_urls = ["http://www.bbc.com/sport/football/teams/bolton-wanderers/results"]
def parse(self, response):
for sel in response.xpath('//tr[@class="report"]'):
game = resultsItem()
game['homeTeam'] = sel.xpath('td[@class="match-details teams"]/p/span[@class="team-home teams"]/a/text()').extract()
game['score'] = sel.xpath('td[@class="match-details teams"]/p/span[@class="score"]/abbr/text()').extract()
game['awayTeam'] = sel.xpath('td[@class="match-details teams"]/p/span[@class="team-away teams"]/a/text()').extract()
game['date'] = response.xpath('td[@class="match-date"]/text()').extract()
yield game
最后,输出的JSON:
[{"date": [], "awayTeam": ["Bolton"], "homeTeam": ["Huddersfield"], "score": [" 2-1 "]},
{"date": [], "awayTeam": ["Blackburn"], "homeTeam": ["Bolton"], "score": [" 2-1 "]},...
为什么我不能得到日期,即使在shell中使用相同的xpath也能得到它?
答案 0 :(得分:2)
不应该是
game['date'] = sel.xpath('td[@class="match-date"]/text()').extract()
而不是
game['date'] = response.xpath('td[@class="match-date"]/text()').extract()
因为你在这个循环中
for sel in response.xpath('//tr[@class="report"]'):