当DOWNLOAD_DELAY设置时,scrapy会忽略scrapy CONCURRENT_REQUESTS吗?

时间:2016-05-26 12:47:39

标签: python scrapy

看看scrapy统计数据(Crawled X pages (at X pages/min))在我看来,只要,例如:

DOWNLOAD_DELAY = 4.5

设置请求变为顺序,无论CONCURRENT_REQUESTS设置是什么。

根据我的理解,不应该对每个并发请求的延迟计数或我是否误解了scrapy架构?所以在我的例子中不应该:

scrapy crawl us_al -a cid_range=000001..000020

使用10个并发请求运行得更快,而不是大约1分50秒(请记住RANDOMIZE_DOWNLOAD_DELAY),这对我有用吗?我该如何改变这种行为?如果没有DOWNLOAD_DELAY查询20项CONCURRENT_REQUESTS = 5需要4秒,CONCURRENT_REQUESTS = 1 10秒,这对我来说更有意义。

以下是蜘蛛的样子:

import random
import re
import scrapy

class UsAlSpider(scrapy.Spider):
    name = "us_al"
    allowed_domains = ["arc-sos.state.al.us"]
    start_urls = []
    custom_settings = {
        'CONCURRENT_REQUESTS': 10,
        'CONCURRENT_REQUESTS_PER_DOMAIN': 10,
        'DOWNLOAD_DELAY': 4.5
    }

    def __init__(self, cid_range=None, *args, **kwargs):
        """
        Range (in the form: 000001..000010)
        """
        super(UsAlSpider, self).__init__(*args, **kwargs)

        self.cid_range = cid_range

    def start_requests(self):
        if self.cid_range and not re.search(r'^\d+\.\.\d+$', self.cid_range):
            self.logger.error('Check input parameter cid_range={} needs to be in form cid_range=000001..000010'.format(self.cid_range))
            return
        # crawl according to input option
        id_range = self.cid_range.split('..')
        shuffled_ids = ["{0:06}".format(i) for i in xrange(
            int(id_range[0]), int(id_range[1]) + 1)]
        random.shuffle(shuffled_ids)
        for id_ in shuffled_ids:
            yield self.make_requests_from_url('http://arc-sos.state.al.us/cgi/corpdetail.mbr/detail?corp={}'.format(id_))

    def parse(self, response):
        # parse the page info

1 个答案:

答案 0 :(得分:2)

CONCURRENT_REQUESTS只是一种保留请求的方法,所以如果你使用任何其他设置(通常由域强制执行),将CONCURRENT_REQUESTS设置为高数字就没有问题

DOWNLOAD_DELAY由域使用,这是正确的,因为它背后的想法是不打击特定的网站。这也会影响CONCURRENT_REQUESTS_PER_DOMAIN,就好像DOWNLOAD_DELAY>0 -> CONCURRENT_REQUESTS_PER_DOMAIN=1