看看scrapy统计数据(Crawled X pages (at X pages/min)
)在我看来,只要,例如:
DOWNLOAD_DELAY = 4.5
设置请求变为顺序,无论CONCURRENT_REQUESTS
设置是什么。
根据我的理解,不应该对每个并发请求的延迟计数或我是否误解了scrapy架构?所以在我的例子中不应该:
scrapy crawl us_al -a cid_range=000001..000020
使用10个并发请求运行得更快,而不是大约1分50秒(请记住RANDOMIZE_DOWNLOAD_DELAY
),这对我有用吗?我该如何改变这种行为?如果没有DOWNLOAD_DELAY
查询20项CONCURRENT_REQUESTS = 5
需要4秒,CONCURRENT_REQUESTS = 1
10秒,这对我来说更有意义。
以下是蜘蛛的样子:
import random
import re
import scrapy
class UsAlSpider(scrapy.Spider):
name = "us_al"
allowed_domains = ["arc-sos.state.al.us"]
start_urls = []
custom_settings = {
'CONCURRENT_REQUESTS': 10,
'CONCURRENT_REQUESTS_PER_DOMAIN': 10,
'DOWNLOAD_DELAY': 4.5
}
def __init__(self, cid_range=None, *args, **kwargs):
"""
Range (in the form: 000001..000010)
"""
super(UsAlSpider, self).__init__(*args, **kwargs)
self.cid_range = cid_range
def start_requests(self):
if self.cid_range and not re.search(r'^\d+\.\.\d+$', self.cid_range):
self.logger.error('Check input parameter cid_range={} needs to be in form cid_range=000001..000010'.format(self.cid_range))
return
# crawl according to input option
id_range = self.cid_range.split('..')
shuffled_ids = ["{0:06}".format(i) for i in xrange(
int(id_range[0]), int(id_range[1]) + 1)]
random.shuffle(shuffled_ids)
for id_ in shuffled_ids:
yield self.make_requests_from_url('http://arc-sos.state.al.us/cgi/corpdetail.mbr/detail?corp={}'.format(id_))
def parse(self, response):
# parse the page info
答案 0 :(得分:2)
CONCURRENT_REQUESTS
只是一种保留请求的方法,所以如果你使用任何其他设置(通常由域强制执行),将CONCURRENT_REQUESTS
设置为高数字就没有问题
DOWNLOAD_DELAY
由域使用,这是正确的,因为它背后的想法是不打击特定的网站。这也会影响CONCURRENT_REQUESTS_PER_DOMAIN
,就好像DOWNLOAD_DELAY>0 -> CONCURRENT_REQUESTS_PER_DOMAIN=1
。