带有scrapy-deltafetch的Scrapy蜘蛛内存泄漏

时间:2019-03-11 16:55:16

标签: python memory-leaks scrapy

我正在运行多只刮scrap的蜘蛛,看来我的记忆力只是在攀升,而从未回落。好像我发生了内存泄漏,运行telnet控制台命令,我得到了这个信息:

>>> prefs()
Live References

HtmlResponse                       59   oldest: 35s ago
MySpider                            1   oldest: 238s ago
Request                         45942   oldest: 235s ago
Selector                           59   oldest: 34s ago

>>> prefs()
Live References

HtmlResponse                       94   oldest: 35s ago
MySpider                            1   oldest: 301s ago
Request                         79139   oldest: 298s ago
Selector                           93   oldest: 35s ago

我的请求似乎随着时间的推移而增长,但从未被释放。我正在使用scrapy-deltafetch https://github.com/scrapy-plugins/scrapy-deltafetch,此插件是否存在已知的内存泄漏?

这里的一个答案是Memory Leak in Scrapy,建议我使用以下配置将其从LIFO切换到FIFO:

DEPTH_PRIORITY = 1 
SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'

但是,这似乎无法解决此问题。答案确实表明我使用了JOBDIR,但我会​​认为是因为我使用的scrapy-deltafetch应该为我处理所有这些事情?相反,我设置:

MySpider.custom_settings['DELTAFETCH_DIR'] = 'crawler/name'

这是我的代码:

class MySpider(SitemapSpider):
    custom_settings = {
        'RANDOMIZE_DOWNLOAD_DELAY': True,
        'DOWNLOAD_TIMEOUT': 60,
        'DEPTH_LIMIT': 0,
        'LOG_LEVEL': 'INFO',
        'DELTAFETCH_ENABLED': True,
        'SPIDER_MIDDLEWARES': {
            'scrapy_deltafetch.DeltaFetch': 100,
        },
        'DOWNLOADER_MIDDLEWARES': {
            'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
            'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
        },
        'ROTATING_PROXY_BAN_POLICY': 'spiders.classes.proxies.policy.MyPolicy',
        'RETRY_HTTP_CODES': [500, 502, 503, 504, 522, 524, 408, 403],
        'ROTATING_PROXY_PAGE_RETRY_TIMES': 10,
        # 'DEPTH_PRIORITY': 1,
        # 'SCHEDULER_DISK_QUEUE': 'scrapy.squeues.PickleFifoDiskQueue',
        # 'SCHEDULER_MEMORY_QUEUE': 'scrapy.squeues.FifoMemoryQueue',
        'TELNETCONSOLE_USERNAME': 'scrapy',
        'TELNETCONSOLE_PASSWORD': '7bkYpew6'
    }

    name = None
    allowed_domains = ['allowed_domains']
    sitemap_urls = ['start_urls']

    def parse(self, response):
        # This is for speed testing
        le = LinkExtractor()
        for link in le.extract_links(response):
            yield response.follow(link.url, self.parse)

0 个答案:

没有答案