Scrapy修复了URL

时间:2014-12-04 17:43:10

标签: python web-scraping scrapy

试图绕过这个...我有一个固定的100,000个网址列表,我想要抓,这很好,我知道如何处理。但首先我需要从初始表单帖子中获取cookie并将其用于后续请求。那会像一只嵌套的蜘蛛吗?只是试图了解该用例的体系结构。

谢谢!

1 个答案:

答案 0 :(得分:1)

scrapy将自动执行cookie。

您需要做的就是首先发布表单,然后产生100,000个网址的请求。

class MySpider(scrapy.Spider):
    name = "myspider"
    start_urls = (
        'https://example.com/login', #login page
    )

    def __init__(self, *args, **kwargs):
        self.url_list = [] #your url lists
        return super(MySpider, self).__init__(*args, **kwargs)

    def parse(self, response):
        data = {}

        return scrapy.FormRequest.from_response(
            response,
            formdata=data,
            callback=self.my_start_requests
        )

    def my_start_requests(self, response):
        # ignore the login callback response
        for url in self.url_list:
            # scrapy will take care the cookies
            yield scrapy.Request(url, callback=self.parse_item, dont_filter=True)

    def parse_item(self, response):
        # your code here
        pass