我正在迭代地为一个id抓两页。第一个刮刀适用于所有ID,但第二个仅适用于一个ID。
class MySpider(scrapy.Spider):
name = "scraper"
allowed_domains = ["example.com"]
start_urls = ['http://example.com/viewData']
def parse(self, response):
ids = ['1', '2', '3']
for id in ids:
# The following method scraps for all id's
yield scrapy.Form.Request.from_response(response,
...
callback=self.parse1)
# The following method scrapes only for 1st id
yield Request(url="http://example.com/viewSomeOtherData",
callback=self.intermediateMethod)
def parse1(self, response):
# Data scraped here using selectors
def intermediateMethod(self, response):
yield scrapy.FormRequest.from_response(response,
...
callback=self.parse2)
def parse2(self, response):
# Some other data scraped here
我想为单个ID废弃两个不同的页面。
答案 0 :(得分:0)
更改以下行:
yield Request(url="http://example.com/viewSomeOtherData",
callback=self.intermediateMethod)
为:
yield Request(url="http://example.com/viewSomeOtherData",
callback=self.intermediateMethod,
dont_filter=True)
为我工作。
Scrapy有一个重复的URL过滤器,这可能会过滤您的请求。尝试在Steve建议的回调中添加dont_filter = True。