我从未使用过Scrapy。 请帮忙!
我想在“next_link”
中为每个请求做出延迟示例:
延迟30秒
延迟30秒
class CVSpider(scrapy.Spider):
name = 'cvspider'
start_urls = ["login"]
custom_settings = {
'DOWNLOAD_DELAY': 0,
'RANDOMIZE_DOWNLOAD_DELAY': True
}
def __init__(self, search_url, name=None, **kwargs):
self.search_url = search_url
def parse(self, response):
xsrf = response.css('input[name="_xsrf"] ::attr(value)')\
.extract_first()
return scrapy.FormRequest.from_response(
response,
formdata={
'username': USERNAME,
'password': PASSWORD,
'_xsrf': xsrf
},
callback=self.after_login
)
def after_login(self, response):
self.logger.info('Parse %s', response.url)
if "account/login" in response.url:
self.logger.error("Login failed!")
return
return scrapy.Request(self.search_url, callback=self.parse_search_page)
def parse_search_page(self, response):
cv_hashes = response\
.css('table.output tr[itemscope="itemscope"]::attr(data-hash)')\
.extract()
total = len(cv_hashes)
start_time = datetime.now()
next_link = response.css('a.Controls-Next::attr(href)')\
.extract_first()
if total == 0:
next_link = None
if next_link is not None:
self.download_delay = 30 - does not work
yield scrapy.Request(
"https://example.com" + next_link,
callback=self.parse_search_page
)
答案 0 :(得分:0)
有一个设置选项可以实现此目的。在settings.py
文件中,设置DOWNLOAD_DELAY
,如下所示:
DOWNLOAD_DELAY = 30000 # Time in milliseconds (30000 ms = 30 seconds)
但请记住从代码中删除custom_settings
。
如果您想使用Spider
的自定义设置执行此操作,请按以下方式修改代码:
class CVSpider(scrapy.Spider):
name = 'cvspider'
start_urls = ["login"]
custom_settings = {
'DOWNLOAD_DELAY': 30000,
'RANDOMIZE_DOWNLOAD_DELAY': False
}
def __init__(self, search_url, name=None, **kwargs):
...
您可以参考documentation了解更多信息。
答案 1 :(得分:0)
我们的想法是只在你的蜘蛛中设置download_delay
变量,scrapy将完成其余的工作,你不需要实际“使用”它。
所以只需将其设置为:
class MySpider(Spider):
...
download_delay = 30000
...
就是这样。