Scrapy刮了一页' n'时间,但在循环中的其他单一时间

时间:2016-04-22 11:39:49

标签: python web-scraping scrapy scrapy-spider scraper

我正在迭代地为一个id抓两页。第一个刮刀适用于所有ID,但第二个仅适用于一个ID。

class MySpider(scrapy.Spider):
  name = "scraper"
  allowed_domains = ["example.com"]
  start_urls = ['http://example.com/viewData']

  def parse(self, response):
    ids = ['1', '2', '3']

    for id in ids:
      # The following method scraps for all id's
      yield scrapy.Form.Request.from_response(response,
                                                   ...
                                              callback=self.parse1)

      # The following method scrapes only for 1st id
      yield Request(url="http://example.com/viewSomeOtherData",
                    callback=self.intermediateMethod)

  def parse1(self, response):
    # Data scraped here using selectors

  def intermediateMethod(self, response):
    yield scrapy.FormRequest.from_response(response,
                                                ...
                                           callback=self.parse2)

  def parse2(self, response):
    # Some other data scraped here

我想为单个ID废弃两个不同的页面。

1 个答案:

答案 0 :(得分:0)

更改以下行:

yield Request(url="http://example.com/viewSomeOtherData",
              callback=self.intermediateMethod)

为:

yield Request(url="http://example.com/viewSomeOtherData",
              callback=self.intermediateMethod,
              dont_filter=True)

为我工作。

Scrapy有一个重复的URL过滤器,这可能会过滤您的请求。尝试在Steve建议的回调中添加dont_filter = True。