Question

我正在尝试使用从网址“ https://umanity.jp/en/racedata/race_6.php”开始的scrapy跟踪上一年的链接。在此网址中，当前年份为2018，并且有上一个按钮。当您单击该按钮时，它将转到2017年，2016年...直到2000年。但是我写的刮擦蜘蛛停在了2017年。我的代码：

import scrapy


class RaceSpider(scrapy.Spider):
    name = 'test'
    allowed_domains = ['umanity.jp']
    start_urls = ['https://umanity.jp/en/racedata/race_6.php']  # start to scrape from this url

    def parse(self, response):
        previous_year_btn = response.xpath('//div[@class="newslist_year_select m_bottom5"]/*[1]')
        if previous_year_btn.extract_first()[1] == 'a':
            href = previous_year_btn.xpath('./@href').extract_first()
            follow_link = response.urljoin(href)
            yield scrapy.Request(follow_link, self.parse_years)

    def parse_years(self, response):
        print(response.url)  # prints only year 2017

无法弄清楚为什么它会在2017年停止并且不延续到前几年。有什么问题吗？

Answer 1

您需要将请求发送到self.parse；不能self.parse_years来获得结果。我试图从xpaths中踢出您的硬编码索引，以使其不易被破坏。请尝试以下方法：

class RaceSpider(scrapy.Spider):
    name = 'test'
    allowed_domains = ['umanity.jp']
    start_urls = ['https://umanity.jp/en/racedata/race_6.php']  # start to scrape from this url

    def parse(self, response):
        previous_year_btn = response.xpath('//div[contains(@class,"newslist_year_select")]/a')
        if 'race_prev.gif' in previous_year_btn.xpath('.//img/@src').extract_first():
            href = previous_year_btn.xpath('./@href').extract_first()
            yield scrapy.Request(response.urljoin(href), self.parse)
            print(response.url)

但是，保持第二种方法有效：

def parse(self, response):      
    yield scrapy.Request(response.url, self.parse_years)  #this is the fix

    previous_year_btn = response.xpath('//div[contains(@class,"newslist_year_select")]/a')
    if 'race_prev.gif' in previous_year_btn.xpath('.//img/@src').extract_first():
        href = previous_year_btn.xpath('./@href').extract_first()
        yield scrapy.Request(response.urljoin(href), self.parse)

def parse_years(self, response):
    print(response.url)

Answer 2

问题在于parse_years函数不查找任何其他链接。

开关：
yield scrapy.Request(follow_link, self.parse_years) 到
yield scrapy.Request(follow_link, self.parse)和所有年份都被找到，因为parse函数继续查找链接。

如果您确实希望使用两个单独的功能（例如，parse_years对数据进行某种处理，而parse用于查找下一个链接），则可以实现。

parse_years只需要这个：

def parse_years(self, response):
    print(response.url)  # prints only year 2017
    yield from self.parse(response)

Scrapy跟随以前的链接

2 个答案: