我试图运行一个输出日志结束的刮刀,如下所示:
2017-04-25 20:22:22 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <429 http://www.apkmirror.com/apk/instagram/instagram-instagram/instagram-instagram-9-0-0-34920-release/instagram-9-0-0-4-android-apk-download/>: HTTP status code is not handled or not allowed
2017-04-25 20:22:22 [scrapy.core.engine] INFO: Closing spider (finished)
2017-04-25 20:22:22 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 16048410,
'downloader/request_count': 32902,
'downloader/request_method_count/GET': 32902,
'downloader/response_bytes': 117633316,
'downloader/response_count': 32902,
'downloader/response_status_count/200': 121,
'downloader/response_status_count/429': 32781,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 4, 25, 18, 22, 22, 710446),
'log_count/DEBUG': 32903,
'log_count/INFO': 32815,
'request_depth_max': 2,
'response_received_count': 32902,
'scheduler/dequeued': 32902,
'scheduler/dequeued/memory': 32902,
'scheduler/enqueued': 32902,
'scheduler/enqueued/memory': 32902,
'start_time': datetime.datetime(2017, 4, 25, 17, 54, 36, 621481)}
2017-04-25 20:22:22 [scrapy.core.engine] INFO: Spider closed (finished)
简而言之,在32,902个请求中,只有121个成功(响应代码200),而剩余部分接收429个请求太多&#39; (参见https://httpstatuses.com/429)。
有没有推荐的解决方法?首先,我想查看429
响应的详细信息,而不是忽略它,因为它可能包含一个Retry-After标头,指示在发出新请求之前需要等待多长时间
此外,如果请求是使用http://blog.michaelyin.info/2014/02/19/scrapy-socket-proxy/中所述的Privoxy和Tor进行的,则可能会实现重试中间件,这会使Tor在发生这种情况时更改其IP地址。这些代码有公开的例子吗?
答案 0 :(得分:10)
您可以修改重试中间件,以便在收到错误时暂停429.将此代码放在middlewares.py
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message
import time
class TooManyRequestsRetryMiddleware(RetryMiddleware):
def __init__(self, crawler):
super(TooManyRequestsRetryMiddleware, self).__init__(crawler.settings)
self.crawler = crawler
@classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def process_response(self, request, response, spider):
if request.meta.get('dont_retry', False):
return response
elif response.status == 429:
self.crawler.engine.pause()
time.sleep(60) # If the rate limit is renewed in a minute, put 60 seconds, and so on.
self.crawler.engine.unpause()
reason = response_status_message(response.status)
return self._retry(request, reason, spider) or response
elif response.status in self.retry_http_codes:
reason = response_status_message(response.status)
return self._retry(request, reason, spider) or response
return response
添加429以重试settings.py
RETRY_HTTP_CODES = [429]
然后在settings.py
上激活它。不要忘记停用默认的重试中间件。
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
'flat.middlewares.TooManyRequestsRetryMiddleware': 543,
}
答案 1 :(得分:5)
如此高的音量将触发较大站点的速率限制,并将完全降低较小的站点。 不要这样做。
对于私密版和托尔版来说,这甚至可能太快了,所以这些也可能是429回复的候选人。
<强>解决方案:强>
启动速度慢。减少并发设置并增加DOWNLOAD_DELAY,以便每秒最多执行1次请求。然后逐步增加这些值,看看会发生什么。这可能听起来很悖论,但你可能会通过变慢来获得更多的物品和更多的200响应。
如果你要抓一个大网站尝试轮换代理。根据我的经验,对于这个网络可能有点笨拙,所以你可能会尝试像Umair建议的代理服务
答案 2 :(得分:0)
您可以使用a
。我得到了429个HTTP代码,但我只允许了它,然后解决了问题。您可以允许进入终端的HTTP代码。这可能可以解决您的问题。
答案 3 :(得分:0)
在 Aminah Nuraini's answer 的基础上,您可以使用 Twisted's Deferreds 来避免通过调用 time.sleep() 来破坏异步
from twisted.internet import reactor, defer
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message
class TooManyRequestsRetryMiddleware(RetryMiddleware):
DEFAULT_DELAY = 600 # 10 min
async def process_response(self, request, response, spider):
if request.meta.get('dont_retry', False):
return response
elif response.status == 429:
try:
delay = int(response.headers.get('retry-after'))
except (TypeError, ValueError):
delay = self.DEFAULT_DELAY
deferred = defer.Deferred()
reactor.callLater(delay, deferred.callback, None)
spider.crawler.engine.pause()
await deferred
spider.crawler.engine.unpause()
reason = response_status_message(response.status)
return self._retry(await deferred, reason, spider) or response
elif response.status in self.retry_http_codes:
reason = response_status_message(response.status)
return self._retry(request, reason, spider) or response
return response
行 await deferred
阻止 process_response
执行,直到 delay
秒过去,但 Scrapy 在此期间可以自由地做其他事情。
注意方法定义中的 async
。 async
/await
是 Python 3.5 和 Scrapy 2.0 中引入的 corutine syntax。
在此示例中,我选择尊重 Retry-After
标头(如果可用),但这并不总是必须的。
仍然需要像original answer一样修改settings.py
。
答案 4 :(得分:0)
这是我发现的,一个简单的技巧
import scrapy
import time ## just add this line
BASE_URL = 'your any url'
class EthSpider(scrapy.Spider):
name = 'eth'
start_urls = [
BASE_URL.format(1)
]
pageNum = 2
def parse(self, response):
data = response.json()
for i in range(len(data['data']['list'])):
yield data['data']['list'][i]
next_page = 'next page url'
time.sleep(0.2) # and add this line
if EthSpider.pageNum <= data['data']['page']:
EthSpider.pageNum += 1
yield response.follow(next_page, callback=self.parse)