我正在使用Scrapy-splash
,但是内存有问题。我可以清楚地看到docker
python3
所使用的内存在逐渐增加,直到PC冻结为止。
由于我有CONCURRENT_REQUESTS=3
并且没有办法3 HTML
消耗10GB RAM,所以无法弄清楚它为什么会这样。
因此,有一种解决方法可将maxrss
设置为某个合理的值。当RAM使用率具有此值时,docker将重新启动,以便刷新RAM。
但是问题是在docker
断开期间,scrapy
继续发送请求,因此有两个urls
未被抓取。重试中间件正在尝试立即重试这些请求,然后放弃。
[scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.ex.com/eiB3t/ via http://127.0.0.1:8050/execute> (failed 2 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionDone: Connection was closed cleanly.>]
2019-03-30 14:28:33 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://www.ex.com/eiB3t/
所以我有两个问题
Scrapy
设置为retry
的请求(假设是分钟,所以docker
有时间重新启动)?答案 0 :(得分:0)
一个更精细的解决方案是设置一个Kubernetes集群,在其中运行多个副本。这样,您可以避免只有一个容器发生故障而影响您的抓取工作。
我认为仅为重试配置等待时间并不容易。您可以使用DOWNLOAD_DELAY(但这会影响所有请求之间的延迟),或者将RETRY_TIMES设置为比默认值2高的值。
答案 1 :(得分:0)
一种方法是将中间件添加到您的Spider(source,linked):
# File: middlewares.py
from twisted.internet import reactor
from twisted.internet.defer import Deferred
class DelayedRequestsMiddleware(object):
def process_request(self, request, spider):
delay_s = request.meta.get('delay_request_by', None)
if not delay_s:
return
deferred = Deferred()
reactor.callLater(delay_s, deferred.callback, None)
return deferred
稍后您可以在Spider中使用这种方式:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
custom_settings = {
'DOWNLOADER_MIDDLEWARES': {'middlewares.DelayedRequestsMiddleware': 123},
}
def start_requests(self):
# This request will have itself delayed by 5 seconds
yield scrapy.Request(url='http://quotes.toscrape.com/page/1/',
meta={'delay_request_by': 5})
# This request will not be delayed
yield scrapy.Request(url='http://quotes.toscrape.com/page/2/')
def parse(self, response):
... # Process results here
您可以使用自定义重试中间件(source)进行此操作,您只需要覆盖当前Retry Middleware的process_response
方法即可:
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message
class CustomRetryMiddleware(RetryMiddleware):
def process_response(self, request, response, spider):
if request.meta.get('dont_retry', False):
return response
if response.status in self.retry_http_codes:
reason = response_status_message(response.status)
# Your delay code here, for example sleep(10) or polling server until it is alive
return self._retry(request, reason, spider) or response
return response
然后启用它,而不是RetryMiddleware
中的默认settings.py
:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
'myproject.middlewarefilepath.CustomRetryMiddleware': 550,
}