我正在运行多只刮scrap的蜘蛛,看来我的记忆力只是在攀升,而从未回落。好像我发生了内存泄漏,运行telnet控制台命令,我得到了这个信息:
>>> prefs()
Live References
HtmlResponse 59 oldest: 35s ago
MySpider 1 oldest: 238s ago
Request 45942 oldest: 235s ago
Selector 59 oldest: 34s ago
>>> prefs()
Live References
HtmlResponse 94 oldest: 35s ago
MySpider 1 oldest: 301s ago
Request 79139 oldest: 298s ago
Selector 93 oldest: 35s ago
我的请求似乎随着时间的推移而增长,但从未被释放。我正在使用scrapy-deltafetch https://github.com/scrapy-plugins/scrapy-deltafetch,此插件是否存在已知的内存泄漏?
这里的一个答案是Memory Leak in Scrapy,建议我使用以下配置将其从LIFO切换到FIFO:
DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'
但是,这似乎无法解决此问题。答案确实表明我使用了JOBDIR,但我会认为是因为我使用的scrapy-deltafetch应该为我处理所有这些事情?相反,我设置:
MySpider.custom_settings['DELTAFETCH_DIR'] = 'crawler/name'
这是我的代码:
class MySpider(SitemapSpider):
custom_settings = {
'RANDOMIZE_DOWNLOAD_DELAY': True,
'DOWNLOAD_TIMEOUT': 60,
'DEPTH_LIMIT': 0,
'LOG_LEVEL': 'INFO',
'DELTAFETCH_ENABLED': True,
'SPIDER_MIDDLEWARES': {
'scrapy_deltafetch.DeltaFetch': 100,
},
'DOWNLOADER_MIDDLEWARES': {
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
},
'ROTATING_PROXY_BAN_POLICY': 'spiders.classes.proxies.policy.MyPolicy',
'RETRY_HTTP_CODES': [500, 502, 503, 504, 522, 524, 408, 403],
'ROTATING_PROXY_PAGE_RETRY_TIMES': 10,
# 'DEPTH_PRIORITY': 1,
# 'SCHEDULER_DISK_QUEUE': 'scrapy.squeues.PickleFifoDiskQueue',
# 'SCHEDULER_MEMORY_QUEUE': 'scrapy.squeues.FifoMemoryQueue',
'TELNETCONSOLE_USERNAME': 'scrapy',
'TELNETCONSOLE_PASSWORD': '7bkYpew6'
}
name = None
allowed_domains = ['allowed_domains']
sitemap_urls = ['start_urls']
def parse(self, response):
# This is for speed testing
le = LinkExtractor()
for link in le.extract_links(response):
yield response.follow(link.url, self.parse)