Scrapy从多个页面中提取相同的数据

时间:2016-08-08 03:49:32

标签: python scrapy

这与我写的here上一个问题有关。我试图从同一域上的多个页面中提取相同的数据。一个小小的解释,我试图从main page上的一堆不同的盒子分数中提取像进攻码,失误等数据。从单个页面中提取数据与生成URL一样正常工作,但是当我尝试让蜘蛛循环遍历所有页面时,不会返回任何内容。我已经查看了人们提出的许多其他问题和文档,我无法弄清楚什么是无效的。代码如下。感谢任何能够提前帮助的人。

import scrapy

from scrapy import Selector
from nflscraper.items import NflscraperItem

class NFLScraperSpider(scrapy.Spider):
    name = "pfr"
    allowed_domains = ['www.pro-football-reference.com/']
    start_urls = [
        "http://www.pro-football-reference.com/years/2015/games.htm"
        #"http://www.pro-football-reference.com/boxscores/201510110tam.htm"
    ]

    def parse(self,response):
        for href in response.xpath('//a[contains(text(),"boxscore")]/@href'):
            item = NflscraperItem()
            url = response.urljoin(href.extract())
            request = scrapy.Request(url, callback=self.parse_dir_contents)
            request.meta['item'] = item
            yield request

    def parse_dir_contents(self,response):
        item = response.meta['item']
        # Code to pull out JS comment - https://stackoverflow.com/questions/38781357/pro-football-reference-team-stats-xpath/38781659#38781659
        extracted_text = response.xpath('//div[@id="all_team_stats"]//comment()').extract()[0]
        new_selector = Selector(text=extracted_text[4:-3].strip())
        # Item population
        item['home_score'] = response.xpath('//*[@id="content"]/table/tbody/tr[2]/td[last()]/text()').extract()[0].strip()
        item['away_score'] = response.xpath('//*[@id="content"]/table/tbody/tr[1]/td[last()]/text()').extract()[0].strip()
        item['home_oyds'] = new_selector.xpath('//*[@id="team_stats"]/tbody/tr[6]/td[2]/text()').extract()[0].strip()
        item['away_oyds'] = new_selector.xpath('//*[@id="team_stats"]/tbody/tr[6]/td[1]/text()').extract()[0].strip()
        item['home_dyds'] = item['away_oyds']
        item['away_dyds'] = item['home_oyds']
        item['home_turn'] = new_selector.xpath('//*[@id="team_stats"]/tbody/tr[8]/td[2]/text()').extract()[0].strip()
        item['away_turn'] = new_selector.xpath('//*[@id="team_stats"]/tbody/tr[8]/td[1]/text()').extract()[0].strip()
        yield item

1 个答案:

答案 0 :(得分:1)

您将的后续请求过滤为非现场,修复您的allowed_domains设置:

allowed_domains = ['pro-football-reference.com'] 

为我工作。