我在抓取javascript网站时遇到了一些问题。我使用scrapy-splash和docker将js渲染为html来刮擦。
import scrapy
from scrapy_splash import SplashRequest
class MySpider (scrapy.Spider):
name = 'spd'
start_urls = ['http://example.com']
def start_requests (self):
for url in self.start_urls:
yield SplashRequest(url, self.parse, endpoint='render.html', args={'wait':0.5},)
def parse (self, response):
for href in response.xpath('xpath'):
yield {'info': href.xpath('xpath')}
以下是我的终端输出:
2017-05-30 13:20:51 [scrapy.core.engine] INFO: Spider opened
2017-05-30 13:20:51 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-05-30 13:20:51 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-05-30 13:20:51 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://example.com via http://192.168.99.100:8050/render.html> (failed 1 times): Connection was refused by other side: 61: Connection refused.
2017-05-30 13:20:51 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://example.com via http://192.168.99.100:8050/render.html> (failed 2 times): Connection was refused by other side: 61: Connection refused.
2017-05-30 13:20:51 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://example.com via http://192.168.99.100:8050/render.html> (failed 3 times): Connection was refused by other side: 61: Connection refused.
2017-05-30 13:20:51 [scrapy.core.scraper] ERROR: Error downloading <GET http://example.com via http://192.168.99.100:8050/render.html>: Connection was refused by other side: 61: Connection refused.
2017-05-30 13:20:51 [scrapy.core.engine] INFO: Closing spider (finished)
答案 0 :(得分:0)
以下日志消息指示Splash docker容器未运行或未在预期端口上侦听。
DEBUG: Retrying <GET http://example.com via http://192.168.99.100:8050/render.html> (failed 1 times): Connection was refused by other side: 61: Connection refused.
DEBUG: Retrying <GET http://example.com via http://192.168.99.100:8050/render.html> (failed 2 times): Connection was refused by other side: 61: Connection refused.
DEBUG: Gave up retrying <GET http://example.com via http://192.168.99.100:8050/render.html> (failed 3 times): Connection was refused by other side: 61: Connection refused.
要查看Docker容器(包括已退出的容器)的状态,请尝试运行:
sudo docker ps -a | grep scrapinghub/splash