我正在尝试使用scrapy-splash插件抓取一个启用了javascript的网站。
我使用这些命令安装了docker和我正在使用的 Ubuntu 16.04
$ sudo docker pull scrapinghub/splash
$ sudo docker run -p 8050:8050 scrapinghub/splash
我有正在运行的启动停靠器就像一切似乎没问题 但是
处理scrapy错误时splash会抛出此错误
2017-07-20 03:03:23+0000 [-] Log opened.
2017-07-20 03:03:23.870491 [-] Splash version: 3.0
2017-07-20 03:03:24.007457 [-] Qt 5.9.1, PyQt 5.9, WebKit 602.1, sip 4.19.3, Twisted 16.1.1, Lua 5.2
2017-07-20 03:03:24.007614 [-] Python 3.5.2 (default, Nov 17 2016, 17:05:23) [GCC 5.4.0 20160609]
2017-07-20 03:03:24.007746 [-] Open files limit: 65536
2017-07-20 03:03:24.007879 [-] Can't bump open files limit
2017-07-20 03:03:24.291391 [-] Xvfb is started: ['Xvfb', ':911054901', '-screen', '0', '1024x768x24', '-nolisten', 'tcp']
QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-root'
2017-07-20 03:03:43.425858 [-] proxy profiles support is enabled, proxy profiles path: /etc/splash/proxy-profiles
2017-07-20 03:04:09.534239 [-] verbosity=1
2017-07-20 03:04:09.534387 [-] slots=50
2017-07-20 03:04:09.534499 [-] argument_cache_max_entries=500
2017-07-20 03:04:09.534974 [-] Web UI: enabled, Lua: enabled (sandbox: enabled)
2017-07-20 03:04:09.535774 [-] Site starting on 8050
2017-07-20 03:04:09.535904 [-] Starting factory <twisted.web.server.Site object at 0x7f0e78e18d30>
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: known incorrect sRGB profile
process 1: D-Bus library appears to be incorrectly set up; failed to read machine uuid: UUID file '/etc/machine-id' should contain a hex string of length 32, not length 0, with no other text
**See the manual page for dbus-uuidgen to correct this issue.
qt.network.ssl: QSslSocket: cannot resolve SSLv2_client_method
qt.network.ssl: QSslSocket: cannot resolve SSLv2_server_method**
我认为这部分可能是问题
qt.network.ssl: QSslSocket: cannot resolve SSLv2_client_method
qt.network.ssl: QSslSocket: cannot resolve SSLv2_server_method
以及网站是HTTPS网站
我在 scrapy
中导入了这样的scrapy-splashfrom scrapy_splash import SplashRequest
我正在提出这样的请求
yield SplashRequest(link, meta={'item': item}, callback=self.parse_data)
而不是
yield scrapy.Request(link, meta={'item': item}, callback=self.parse_data)
但像往常一样,splash没有处理请求
我在这里做错了什么? Ubuntu有问题吗?
crawl sofaspider -o out.csv
2017-07-20 13:03:40 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: sofa)
2017-07-20 13:03:40 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'sofa.spiders', 'FEED_URI': 'out.csv', 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter', 'SPIDER_MODULES': ['sofa.spiders'], 'BOT_NAME': 'sofa', 'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36', 'FEED_FORMAT': 'csv'}
2017-07-20 13:03:40 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2017-07-20 13:03:40 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy_splash.SplashCookiesMiddleware',
'scrapy_splash.SplashMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-07-20 13:03:40 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy_splash.SplashDeduplicateArgsMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-07-20 13:03:40 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-07-20 13:03:40 [scrapy.core.engine] INFO: Spider opened
2017-07-20 13:03:40 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-07-20 13:03:40 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-07-20 13:03:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.raymourflanigan.com/Sofas.aspx> (referer: None)
2017-07-20 13:04:17 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.raymourflanigan.com/willoughby-sofa-200326456.aspx via http://localhost:8050/render.html> (failed 1 times): 504 Gateway Time-out
2017-07-20 13:04:17 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.raymourflanigan.com/union-square-sofa-200223105.aspx via http://localhost:8050/render.html> (failed 1 times): 504 Gateway Time-out
2017-07-20 13:04:17 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.raymourflanigan.com/castin-microfiber-sofa-200278403.aspx via http://localhost:8050/render.html> (failed 1 times): 504 Gateway Time-out
2017-07-20 13:04:17 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.raymourflanigan.com/toby-microfiber-leather-look-reclining-sofa-200217215.aspx via http://localhost:8050/render.html> (failed 1 times): 504 Gateway Time-out
2017-07-20 13:04:17 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.raymourflanigan.com/bryant-II-leather-power-reclining-sofa-217282538.aspx via http://localhost:8050/render.html> (failed 1 times): 504 Gateway Time-out
2017-07-20 13:04:17 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.raymourflanigan.com/crosby-sofa-with-chaise-200235097.aspx via http://localhost:8050/render.html> (failed 1 times): 504 Gateway Time-out
2017-07-20 13:04:17 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.raymourflanigan.com/anastasia-sofa-200209167.aspx via http://localhost:8050/render.html> (failed 1 times): 504 Gateway Time-out
2017-07-20 13:04:17 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.raymourflanigan.com/stylus-power-reclining-sofa-202239352.aspx via http://localhost:8050/render.html> (failed 1 times): 504 Gateway Time-out
2017-07-20 13:04:40 [scrapy.extensions.logstats] INFO: Crawled 1 pages (at 1 pages/min), scraped 0 items (at 0 items/min)
2017-07-20 13:04:47 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.raymourflanigan.com/cordelia-sofa-200211201.aspx via http://localhost:8050/render.html> (failed 1 times): 504 Gateway Time-out
2017-07-20 13:04:47 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.raymourflanigan.com/ellington-leather-power-reclining-sofa-202291427.aspx via http://localhost:8050/render.html> (failed 1 times): 504 Gateway Time-out
2017-07-20 13:04:47 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.raymourflanigan.com/delano-power-reclining-sofa-200212520.aspx via http://localhost:8050/render.html> (failed 1 times): 504 Gateway Time-out
2017-07-20 13:04:47 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.raymourflanigan.com/quincey-power-reclining-sofa-200215627.aspx via http://localhost:8050/render.html> (failed 1 times): 504 Gateway Time-out
2017-07-20 13:04:47 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.raymourflanigan.com/corliss-sofa-200331104.aspx via http://localhost:8050/render.html> (failed 1 times): 504 Gateway Time-out
2017-07-20 13:04:47 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.raymourflanigan.com/skye-microfiber-power-reclining-sofa-200320074.aspx via http://localhost:8050/render.html> (failed 1 times): 504 Gateway Time-out
2017-07-20 13:04:47 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.raymourflanigan.com/mckinley-sofa-200211302.aspx via http://localhost:8050/render.html> (failed 1 times): 504 Gateway Time-out
2017-07-20 13:04:47 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.raymourflanigan.com/diana-sofa-200345115.aspx via http://localhost:8050/render.html> (failed 1 times): 504 Gateway Time-out