Python Scrapy:我如何使用self.download_delay

时间:2017-04-06 16:33:02

标签: python scrapy

我从未使用过Scrapy。 请帮忙!

我想在“next_link”

中为每个请求做出延迟

示例:

获取https://example.com/?page=1

延迟30秒

获取https://example.com/?page=2

延迟30秒

class CVSpider(scrapy.Spider):
    name = 'cvspider'
    start_urls = ["login"]
    custom_settings = {
        'DOWNLOAD_DELAY': 0,
        'RANDOMIZE_DOWNLOAD_DELAY': True
    }

    def __init__(self, search_url, name=None, **kwargs):
        self.search_url = search_url

    def parse(self, response):
        xsrf = response.css('input[name="_xsrf"] ::attr(value)')\
                       .extract_first()
        return scrapy.FormRequest.from_response(
            response,
            formdata={
                'username': USERNAME,
                'password': PASSWORD,
                '_xsrf': xsrf
            },
            callback=self.after_login
        )

    def after_login(self, response):
        self.logger.info('Parse %s', response.url)
        if "account/login" in response.url:
            self.logger.error("Login failed!")
            return

        return scrapy.Request(self.search_url, callback=self.parse_search_page)

    def parse_search_page(self, response):
        cv_hashes = response\
            .css('table.output tr[itemscope="itemscope"]::attr(data-hash)')\
            .extract()
        total = len(cv_hashes)
        start_time = datetime.now()
        next_link = response.css('a.Controls-Next::attr(href)')\
                            .extract_first()
        if total == 0:
            next_link = None
        if next_link is not None:
            self.download_delay = 30 - does not work
            yield scrapy.Request(
                "https://example.com" + next_link,
                callback=self.parse_search_page
            )

2 个答案:

答案 0 :(得分:0)

有一个设置选项可以实现此目的。在settings.py文件中,设置DOWNLOAD_DELAY,如下所示:

DOWNLOAD_DELAY = 30000  # Time in milliseconds (30000 ms = 30 seconds)

但请记住从代码中删除custom_settings

如果您想使用Spider的自定义设置执行此操作,请按以下方式修改代码:

class CVSpider(scrapy.Spider):
    name = 'cvspider'
    start_urls = ["login"]
    custom_settings = {
        'DOWNLOAD_DELAY': 30000,
        'RANDOMIZE_DOWNLOAD_DELAY': False
    }

    def __init__(self, search_url, name=None, **kwargs):
    ...

您可以参考documentation了解更多信息。

答案 1 :(得分:0)

我们的想法是只在你的蜘蛛中设置download_delay变量,scrapy将完成其余的工作,你不需要实际“使用”它。

所以只需将其设置为:

class MySpider(Spider):
  ...
  download_delay = 30000
  ...

就是这样。