试图绕过这个...我有一个固定的100,000个网址列表,我想要抓,这很好,我知道如何处理。但首先我需要从初始表单帖子中获取cookie并将其用于后续请求。那会像一只嵌套的蜘蛛吗?只是试图了解该用例的体系结构。
谢谢!
答案 0 :(得分:1)
scrapy将自动执行cookie。
您需要做的就是首先发布表单,然后产生100,000个网址的请求。
class MySpider(scrapy.Spider):
name = "myspider"
start_urls = (
'https://example.com/login', #login page
)
def __init__(self, *args, **kwargs):
self.url_list = [] #your url lists
return super(MySpider, self).__init__(*args, **kwargs)
def parse(self, response):
data = {}
return scrapy.FormRequest.from_response(
response,
formdata=data,
callback=self.my_start_requests
)
def my_start_requests(self, response):
# ignore the login callback response
for url in self.url_list:
# scrapy will take care the cookies
yield scrapy.Request(url, callback=self.parse_item, dont_filter=True)
def parse_item(self, response):
# your code here
pass