Scrapy:使用shell中的xpath检索数据但不在item中检索

时间:2014-12-30 14:40:20

标签: shell xpath web-scraping scrapy web-crawler

我正在使用scrapy构建一个简单的网络刮刀,以便从BBC网站获得足球队的结果。页面(http://www.bbc.com/sport/football/teams/bolton-wanderers/results)中的相关HTML是:

<tr class="report" id="match-row-EFBO755964">
  <td class="statistics show" title="Show latest match stats">
    <button>Show</button> 
  </td> 
  <td class="match-competition"> Championship  </td>  
  <td class="match-details teams"> 
    <p> 
      <span class="team-home teams"> <a href="/sport/football/teams/huddersfield-town">Huddersfield</a> </span>   
      <span class="score"> <abbr title="Score"> 2-1 </abbr> </span>   
      <span class="team-away teams"> <a href="/sport/football/teams/bolton-wanderers">Bolton</a> </span>   
    </p> 
  </td> 
  <td class="match-date"> Sun 28 Dec </td>   
  <td class="time">  Full time  </td>   
  <td class="status">   <a class="report" href="/sport/football/30566395">Report</a>
  </td> 
</tr>

当我尝试使用scrapy shell进行抓取时,这是输出:

$ scrapy shell http://www.bbc.com/sport/football/teams/bolton-wanderers/results

>>> response.selector.xpath('//tr[@class="report"]/td[@class="match-date"]/text()').extract()
[u' Sun 28 Dec ', u' Fri 26 Dec ', u' Fri 19 Dec ', u' Sat 13 Dec ',...]

但是,当我在蜘蛛中使用相同的xpath时,我无法获得这些日期。 这是项目:

class resultsItem(scrapy.Item):
  date          = scrapy.Field()
  homeTeam      = scrapy.Field()
  score         = scrapy.Field()
  awayTeam      = scrapy.Field()

这是蜘蛛:

class resultsSpider(scrapy.Spider):
name = "results"
allowed_domains = ["bbc.com"]
start_urls = ["http://www.bbc.com/sport/football/teams/bolton-wanderers/results"]

def parse(self, response):
    for sel in response.xpath('//tr[@class="report"]'):
        game = resultsItem()
        game['homeTeam'] = sel.xpath('td[@class="match-details teams"]/p/span[@class="team-home teams"]/a/text()').extract()
        game['score'] = sel.xpath('td[@class="match-details teams"]/p/span[@class="score"]/abbr/text()').extract()
        game['awayTeam'] = sel.xpath('td[@class="match-details teams"]/p/span[@class="team-away teams"]/a/text()').extract()
        game['date'] = response.xpath('td[@class="match-date"]/text()').extract()

        yield game

最后,输出的JSON:

[{"date": [], "awayTeam": ["Bolton"], "homeTeam": ["Huddersfield"], "score": [" 2-1 "]},
{"date": [], "awayTeam": ["Blackburn"], "homeTeam": ["Bolton"], "score": [" 2-1 "]},...

为什么我不能得到日期,即使在shell中使用相同的xpath也能得到它?

1 个答案:

答案 0 :(得分:2)

不应该是

game['date'] = sel.xpath('td[@class="match-date"]/text()').extract()

而不是

game['date'] = response.xpath('td[@class="match-date"]/text()').extract()

因为你在这个循环中

for sel in response.xpath('//tr[@class="report"]'):