Scrapy:最大文件大小错误

时间:2017-08-01 16:21:29

标签: python scrapy

我在尝试使用Scrapy下载大型~1.8gb文件时遇到问题,我的代码:

import scrapy

class CHSpider(scrapy.Spider):
    name = "ch_accountdata"
    allowed_domains = ['download.companieshouse.gov.uk']
    start_urls = ['http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html']

    custom_settings = {
        'DOWNLOAD_WARNSIZE': 0,
    }

    def parse(self, response):
        relative_url = response.xpath("//div[@class='grid_7 push_1 omega']/ul/li[12]/a/@href").extract()[0]
        download_url = response.urljoin(relative_url)
        yield {
            'file_urls': [download_url]
        }

这会返回错误:



2017-08-01 17:10:33 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: develop)
2017-08-01 17:10:33 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'develop.spiders', 'SPIDER_MODULES': ['develop.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'develop'}
2017-08-01 17:10:33 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2017-08-01 17:10:34 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-08-01 17:10:34 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-08-01 17:10:34 [scrapy.middleware] INFO: Enabled item pipelines:
['scrapy.pipelines.files.FilesPipeline']
2017-08-01 17:10:34 [scrapy.core.engine] INFO: Spider opened
2017-08-01 17:10:34 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-08-01 17:10:34 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2017-08-01 17:10:35 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://download.companieshouse.gov.uk/robots.txt> (referer: None)
2017-08-01 17:10:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html> (referer: None)
2017-08-01 17:10:35 [scrapy.core.downloader.handlers.http11] ERROR: Cancelling download of http://download.companieshouse.gov.uk/Accounts_Monthly_Data-June2017.zip: expected response size (1240658506) larger than download max size (1073741824).
2017-08-01 17:10:35 [scrapy.pipelines.files] WARNING: File (unknown-error): Error downloading file from <GET http://download.companieshouse.gov.uk/Accounts_Monthly_Data-June2017.zip> referred in <None>: Cancelling download of http://download.companieshouse.gov.uk/Accounts_Monthly_Data-June2017.zip: expected response size (1240658506) larger than download max size (1073741824).
2017-08-01 17:10:35 [scrapy.core.scraper] DEBUG: Scraped from <200 http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html>
{'files': [], 'file_urls': [u'http://download.companieshouse.gov.uk/Accounts_Monthly_Data-June2017.zip']}
2017-08-01 17:10:35 [scrapy.core.engine] INFO: Closing spider (finished)
2017-08-01 17:10:35 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
 'downloader/exception_type_count/twisted.internet.defer.CancelledError': 1,
 'downloader/request_bytes': 755,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 11061,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/404': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 8, 1, 16, 10, 35, 806000),
 'item_scraped_count': 1,
 'log_count/DEBUG': 4,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'log_count/WARNING': 1,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2017, 8, 1, 16, 10, 34, 559000)}
2017-08-01 17:10:35 [scrapy.core.engine] INFO: Spider closed (finished)
&#13;
&#13;
&#13;

我已将以下内容添加到自定义设置中:

'DOWNLOAD_MAXSIZE' : 0,
'DOWNLOAD_TIMEOUT': 600

这会产生一个不同的错误,Scrapy似乎没有停止:

&#13;
&#13;
2017-08-01 16:41:47 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: develop)
2017-08-01 16:41:47 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'develop.spiders', 'SPIDER_MODULES': ['develop.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'develop'}
2017-08-01 16:41:47 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2017-08-01 16:41:48 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-08-01 16:41:48 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-08-01 16:41:48 [scrapy.middleware] INFO: Enabled item pipelines:
['scrapy.pipelines.files.FilesPipeline']
2017-08-01 16:41:48 [scrapy.core.engine] INFO: Spider opened
2017-08-01 16:41:48 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-08-01 16:41:48 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2017-08-01 16:41:50 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://download.companieshouse.gov.uk/robots.txt> (referer: None)
2017-08-01 16:41:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html> (referer: None)
2017-08-01 16:42:48 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 2 pages/min), scraped 0 items (at 0 items/min)
2017-08-01 16:43:48 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-08-01 16:44:48 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
Unhandled Error
Traceback (most recent call last):
  File "c:\python27\lib\site-packages\twisted\internet\endpoints.py", line 125, in dataReceived
    return self._wrappedProtocol.dataReceived(data)
  File "c:\python27\lib\site-packages\twisted\web\_newclient.py", line 1595, in dataReceived
    self._giveUp(Failure())
  File "c:\python27\lib\site-packages\twisted\web\_newclient.py", line 1585, in _giveUp
    self._disconnectParser(reason)
  File "c:\python27\lib\site-packages\twisted\web\_newclient.py", line 1573, in _disconnectParser
    parser.connectionLost(reason)
--- <exception caught here> ---
  File "c:\python27\lib\site-packages\twisted\web\_newclient.py", line 558, in connectionLost
    self.response)))
  File "c:\python27\lib\site-packages\twisted\web\_newclient.py", line 964, in dispatcher
    return func(*args, **kwargs)
  File "c:\python27\lib\site-packages\twisted\web\_newclient.py", line 1220, in _bodyDataFinished_CONNECTED
    self._bodyProtocol.connectionLost(reason)
  File "c:\python27\lib\site-packages\scrapy\core\downloader\handlers\http11.py", line 434, in connectionLost
    body = self._bodybuf.getvalue()
exceptions.MemoryError:

2017-08-01 16:45:29 [twisted] CRITICAL: Unhandled Error
Traceback (most recent call last):
  File "c:\python27\lib\site-packages\twisted\internet\endpoints.py", line 125, in dataReceived
    return self._wrappedProtocol.dataReceived(data)
  File "c:\python27\lib\site-packages\twisted\web\_newclient.py", line 1595, in dataReceived
    self._giveUp(Failure())
  File "c:\python27\lib\site-packages\twisted\web\_newclient.py", line 1585, in _giveUp
    self._disconnectParser(reason)
  File "c:\python27\lib\site-packages\twisted\web\_newclient.py", line 1573, in _disconnectParser
    parser.connectionLost(reason)
--- <exception caught here> ---
  File "c:\python27\lib\site-packages\twisted\web\_newclient.py", line 558, in connectionLost
    self.response)))
  File "c:\python27\lib\site-packages\twisted\web\_newclient.py", line 964, in dispatcher
    return func(*args, **kwargs)
  File "c:\python27\lib\site-packages\twisted\web\_newclient.py", line 1220, in _bodyDataFinished_CONNECTED
    self._bodyProtocol.connectionLost(reason)
  File "c:\python27\lib\site-packages\scrapy\core\downloader\handlers\http11.py", line 434, in connectionLost
    body = self._bodybuf.getvalue()
exceptions.MemoryError:

2017-08-01 16:45:48 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-08-01 16:46:48 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-08-01 16:47:48 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-08-01 16:48:48 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-08-01 16:49:48 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-08-01 16:50:48 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-08-01 16:51:48 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
Unhandled Error
Traceback (most recent call last):
  File "c:\python27\lib\site-packages\scrapy\commands\crawl.py", line 58, in run
    self.crawler_process.start()
  File "c:\python27\lib\site-packages\scrapy\crawler.py", line 285, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "c:\python27\lib\site-packages\twisted\internet\base.py", line 1243, in run
    self.mainLoop()
  File "c:\python27\lib\site-packages\twisted\internet\base.py", line 1252, in mainLoop
    self.runUntilCurrent()
--- <exception caught here> ---
  File "c:\python27\lib\site-packages\twisted\internet\base.py", line 878, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "c:\python27\lib\site-packages\twisted\internet\defer.py", line 547, in cancel
    self.result.cancel()
  File "c:\python27\lib\site-packages\twisted\internet\defer.py", line 536, in cancel
    canceller(self)
  File "c:\python27\lib\site-packages\scrapy\core\downloader\handlers\http11.py", line 352, in _cancel
    txresponse._transport._producer.abortConnection()
exceptions.AttributeError: 'NoneType' object has no attribute 'abortConnection'

2017-08-01 16:51:50 [twisted] CRITICAL: Unhandled Error
Traceback (most recent call last):
  File "c:\python27\lib\site-packages\scrapy\commands\crawl.py", line 58, in run
    self.crawler_process.start()
  File "c:\python27\lib\site-packages\scrapy\crawler.py", line 285, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "c:\python27\lib\site-packages\twisted\internet\base.py", line 1243, in run
    self.mainLoop()
  File "c:\python27\lib\site-packages\twisted\internet\base.py", line 1252, in mainLoop
    self.runUntilCurrent()
--- <exception caught here> ---
  File "c:\python27\lib\site-packages\twisted\internet\base.py", line 878, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "c:\python27\lib\site-packages\twisted\internet\defer.py", line 547, in cancel
    self.result.cancel()
  File "c:\python27\lib\site-packages\twisted\internet\defer.py", line 536, in cancel
    canceller(self)
  File "c:\python27\lib\site-packages\scrapy\core\downloader\handlers\http11.py", line 352, in _cancel
    txresponse._transport._producer.abortConnection()
exceptions.AttributeError: 'NoneType' object has no attribute 'abortConnection'

2017-08-01 16:52:48 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-08-01 16:53:48 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-08-01 16:54:48 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 0 pages/min), 
&#13;
&#13;
&#13;

编辑 - 完整settings.py文件:

&#13;
&#13;
# -*- coding: utf-8 -*-

# Scrapy settings for develop project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'develop'

SPIDER_MODULES = ['develop.spiders']
NEWSPIDER_MODULE = 'develop.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'develop (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'develop.middlewares.DevelopSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'develop.middlewares.MyCustomDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'develop.pipelines.DevelopPipeline': 300,
ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}
FILES_STORE = '/Users/MichaelAnderson/GDrive/Python/develop/data'
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
&#13;
&#13;
&#13;

我没有向pipelines.py添加任何内容。

Items.py看起来像:

import scrapy
from scrapy.item import Item, Field

class FiledownloadItem(Item):
    file_urls = Field()
    files = Field()

我更改了自定义设置的顺序:

custom_settings = {
'DOWNLOAD_TIMEOUT': 60000,
'DOWNLOAD_MAXSIZE': 12406585060,
'DOWNLOAD_WARNSIZE': 0
}

连接丢失错误:

&#13;
&#13;
2017-08-04 16:29:05 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: develop)
2017-08-04 16:29:05 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'develop.spiders', 'SPIDER_MODULES': ['develop.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'develop'}
2017-08-04 16:29:05 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2017-08-04 16:29:05 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-08-04 16:29:05 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-08-04 16:29:05 [scrapy.middleware] INFO: Enabled item pipelines:
['scrapy.pipelines.files.FilesPipeline']
2017-08-04 16:29:05 [scrapy.core.engine] INFO: Spider opened
2017-08-04 16:29:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-08-04 16:29:06 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-08-04 16:29:07 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://download.companieshouse.gov.uk/robots.txt> (referer: None)
2017-08-04 16:29:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html> (referer: None)
2017-08-04 16:30:05 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 2 pages/min), scraped 0 items (at 0 items/min)
2017-08-04 16:31:05 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
Unhandled Error
Traceback (most recent call last):
  File "c:\python27\lib\site-packages\twisted\internet\endpoints.py", line 125, in dataReceived
    return self._wrappedProtocol.dataReceived(data)
  File "c:\python27\lib\site-packages\twisted\web\_newclient.py", line 1595, in dataReceived
    self._giveUp(Failure())
  File "c:\python27\lib\site-packages\twisted\web\_newclient.py", line 1585, in _giveUp
    self._disconnectParser(reason)
  File "c:\python27\lib\site-packages\twisted\web\_newclient.py", line 1573, in _disconnectParser
    parser.connectionLost(reason)
--- <exception caught here> ---
  File "c:\python27\lib\site-packages\twisted\web\_newclient.py", line 558, in connectionLost
    self.response)))
  File "c:\python27\lib\site-packages\twisted\web\_newclient.py", line 964, in dispatcher
    return func(*args, **kwargs)
  File "c:\python27\lib\site-packages\twisted\web\_newclient.py", line 1220, in _bodyDataFinished_CONNECTED
    self._bodyProtocol.connectionLost(reason)
  File "c:\python27\lib\site-packages\scrapy\core\downloader\handlers\http11.py", line 434, in connectionLost
    body = self._bodybuf.getvalue()
exceptions.MemoryError:

2017-08-04 16:31:27 [twisted] CRITICAL: Unhandled Error
Traceback (most recent call last):
  File "c:\python27\lib\site-packages\twisted\internet\endpoints.py", line 125, in dataReceived
    return self._wrappedProtocol.dataReceived(data)
  File "c:\python27\lib\site-packages\twisted\web\_newclient.py", line 1595, in dataReceived
    self._giveUp(Failure())
  File "c:\python27\lib\site-packages\twisted\web\_newclient.py", line 1585, in _giveUp
    self._disconnectParser(reason)
  File "c:\python27\lib\site-packages\twisted\web\_newclient.py", line 1573, in _disconnectParser
    parser.connectionLost(reason)
--- <exception caught here> ---
  File "c:\python27\lib\site-packages\twisted\web\_newclient.py", line 558, in connectionLost
    self.response)))
  File "c:\python27\lib\site-packages\twisted\web\_newclient.py", line 964, in dispatcher
    return func(*args, **kwargs)
  File "c:\python27\lib\site-packages\twisted\web\_newclient.py", line 1220, in _bodyDataFinished_CONNECTED
    self._bodyProtocol.connectionLost(reason)
  File "c:\python27\lib\site-packages\scrapy\core\downloader\handlers\http11.py", line 434, in connectionLost
    body = self._bodybuf.getvalue()
exceptions.MemoryError:

2017-08-04 16:32:05 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-08-04 16:33:05 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
&#13;
&#13;
&#13;

我是否想让Scrapy做一些无法处理的事情? 非常感谢您的帮助

2 个答案:

答案 0 :(得分:0)

不,我不是百分之百虽然在写这个答案时感觉很舒服,我注意到错误日志之间的差异,而不是第一个错误日志与你提供的蜘蛛脚本相比......

鉴于我可能没有得到完整的图片,因为你只提供了蜘蛛,你也应该提供管道和整个设置文件。我将继续进行堆栈跟踪工作,确保现在应该足以让你得到足够的答案。

关于差异......

yield {
    'file_urls': [download_url]
}

#First Error Log Line 36

{'files': [], 'file_urls': [u'http://download.companieshouse.gov.uk/Accounts_Monthly_Data-June2017.zip']}

假设您可能没有深入了解scrapy的官方文档。当谈到用scrapy下载任何东西时,必须要有几个人

  1. 在您的items.py文件中,无论您是下载图片还是文件(我真的不知道命名其中一个或另一个......对我来说它是所有文件但是嘿,阅读文档大声笑)你必须为你的项目提供以下句子或键。 &#39;文件&#39;或&#39;图像&#39; ...和&#39; file_url&#39;或者&#39; image_url&#39;。我会给你一个提示......当你最终在你的蜘蛛中声明你的网址对你想要下载的任何内容的响应路径时,文件或图像响应路径2应该是完全相同的之一。

  2. 在您的settings.py文件中,启用项目管道,Weatherby图像管道或文件管道是另一个必要步骤,还包括存储文件的目录路径以及其他设置的重要性您所说的最大下载和下载超时或者下载文件的任何或非必要的艺术要求,

  3. 最后,其中一个要求是根据文件或图片是否正确设置项目管道。

  4.   

    我指出,堆栈跟踪仍然表示下载的事实仍然存在差异我觉得有问题,但请查看官方文档和/或更新您的问题以包含所有内容....虽然我可以看到你如何认为你提供的这两个配置设置似乎是提供的相应线路,因为空中日志表明文件超出了最大限制。我愿意打赌你没有正确配置这个项目

    https://doc.scrapy.org/en/latest/topics/media-pipeline.html

    答案更新

    OHHHH!我刚刚注意到...... 1)您将DL max设置为无限...&#34; 0&#34;

    'DOWNLOAD_MAXSIZE' : 0,
    'DOWNLOAD_TIMEOUT': 600
    

    2)错误日志显示它停止了爬行/下载,因为它大于预期的&#34;响应&#34; ...这就是说大小设置为MAX这是默认的......

    所以......为什么你明确设置为无限的设置会被忽略?

    * DURP! *

    正确的语法! ...错误的套管lol,根据官方文件......应该全部是小写的......

    https://doc.scrapy.org/en/latest/topics/settings.html

    这就是我现在得到的所有内容...仍然希望你删除完整的设置文件,但我们可以来回2 lol ..我喜欢麻烦拍摄。

答案 1 :(得分:0)

内存错误可能会导致您遇到的错误 您可以在控制台中键入以下命令,以检查您是在运行32位还是64位Python。

python -c 'import sys;print("64bit" if sys.maxsize > 2**32 else "32bit")'

如果使用32位版本,转换为Python 64bit可能会解决您的问题。