我已按照README文件中所述的说明(包括中间件设置等)在python虚拟环境(Ubuntu 16.04)中安装了splash和scrapy-splash。即使,我没有在日志文件中收到任何错误(显然),由ScrapySplash返回的html不包含由Splash处理的html,仅包含由Scrapy下载的html(不使用splash)。
在某些情况下,我可以获取正确的html。这些是:
但是,scrapy-splash不能使用SplashRequest返回正确的HTML:
yield SplashRequest(url, self.parse, endpoint='render.html', args={'wait': 0.5})
这是我在settings.py文件中的配置:
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPLASH_URL = 'http://127.0.0.1:8050/'
SPLASH_COOKIES_DEBUG = True
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
我希望通过splash处理html的输出,但是它只返回html而不进行处理。
启动docker消息
process 1: D-Bus library appears to be incorrectly set up; failed to read machine uuid: UUID file '/etc/machine-id' should contain a hex string of length 32, not length 0, with no other text
See the manual page for dbus-uuidgen to correct this issue.
qt.network.ssl: QSslSocket: cannot resolve SSLv2_client_method
qt.network.ssl: QSslSocket: cannot resolve SSLv2_server_method
2019-04-17 14:35:28.198194 [events] {"timestamp": 1555511728, "status_code": 200, "user-agent": "Scrapy/1.3.3 (+http://scrapy.org)", "client_ip": "172.17.0.1", "load": [0.15, 0.38, 0.35], "rendertime": 5.785578966140747, "active": 0, "fds": 68, "qsize": 0, "method": "POST", "_id": 140284272664528, "path": "/render.html", "args": {"headers": {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "User-Agent": "Scrapy/1.3.3 (+http://scrapy.org)", "Accept-Language": "en", "Cookie": "__cfduid=d035cc38f38ee9f555aec777db4b1b8f81555511718"}, "uid": 140284272664528, "wait": 0.5, "url": "https://www.tampabay.com/events/"}, "maxrss": 159672}
2019-04-17 14:35:28.198893 [-] "172.17.0.1" - - [17/Apr/2019:14:35:27 +0000] "POST /render.html HTTP/1.1" 200 34075 "-" "Scrapy/1.3.3 (+http://scrapy.org)
Scrapy LOG消息:
2019-04-17 16:35:18 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: tampabay)
2019-04-17 16:35:18 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tampabay.spiders', 'ROBOTSTXT_OBEY': True, 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter', 'SPIDER_MODULES': ['tampabay.spiders'], 'BOT_NAME': 'tampabay', 'LOG_FILE': 'tampabay.log', 'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage', 'DOWNLOAD_DELAY': 3}
2019-04-17 16:35:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2019-04-17 16:35:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy_splash.SplashCookiesMiddleware',
'scrapy_splash.SplashMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-17 16:35:18 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy_splash.SplashDeduplicateArgsMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-17 16:35:18 [scrapy.middleware] INFO: Enabled item pipelines:
['tampabay.pipelines.TampabayPipeline']
2019-04-17 16:35:18 [scrapy.core.engine] INFO: Spider opened
2019-04-17 16:35:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-17 16:35:18 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-17 16:35:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tampabay.com/robots.txt> (referer: None)
2019-04-17 16:35:18 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://127.0.0.1:8050/robots.txt> (referer: None)
2019-04-17 16:35:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tampabay.com/events/ via http://127.0.0.1:8050/render.html> (referer: None)
2019-04-17 16:35:28 [tampabay] DEBUG: ############## INSIDE FUNCTION -> parse ###############
2019-04-17 16:35:28 [tampabay] DEBUG: EVENTS: 0
2019-04-17 16:35:28 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-17 16:35:28 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1037,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 2,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 35911,
'downloader/response_count': 3,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 4, 17, 14, 35, 28, 333825),
'log_count/DEBUG': 6,
'log_count/INFO': 7,
'response_received_count': 3,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'splash/render.html/request_count': 1,
'splash/render.html/response_count/200': 1,
'start_time': datetime.datetime(2019, 4, 17, 14, 35, 18, 83737)}
2019-04-17 16:35:28 [scrapy.core.engine] INFO: Spider closed (finished)