Question

我正在尝试抓取广告网页。广告缩略图显示在分页的第一页上。点击每个缩略图会显示特定广告的详细信息，其中包括广告的过帐日期。现在我只想抓取最后一天发布的广告。

My Scrapy蜘蛛具有以下结构：

#opens the homepage
def start_requests(self):
        url = 'url_to_page'
        yield scrapy.Request(url=url, callback=self.parse)

#parse the page for ad links and follow each of them
def parse(self, response):
    #get all links from current page; not shown here
    for link in ad_links:
        request = scrapy.Request(link, callback=self.parse_single_ad)

    #follow the next page, only if today's date > posting date <---

def parse_single_ad(self, response):
    #get the posting date; not shown here
    return item

问题是我只能访问parse_single_ad()中的发布日期，但我必须根据广告的发布日期停止parse()中的分页。有没有办法从parse_single_ad()访问parse()中检索到的项目？更一般地说，我可以从其父函数中访问回调数据吗？

Answer 1

只要您想手动关闭Spider，就可以使用CloseSpider。

如果您愿意，可以在Spider课程中执行此操作，甚至可以Pipeline执行此操作。

from scrapy import scrapy.exceptions.CloseSpider

def parse(self, response):
     if some thing: # write your condition here
          raise CloseSpider('All ads scraped, now closing spider.')
     else:
          # Scrape next page

修改

OP表示，在广告详细信息页面被删除之前，他无法访问广告的发布日期。

但是看看这个，您在列表页面上发布了广告日期。

Scrapy获得回调数据

1 个答案: