在我的 settings.py
下SPLASH_URL = 'http://127.0.0.1:8050'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
我的Spider源代码
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider
from scrapy_splash import SplashRequest
class SampleSpider(CrawlSpider):
name = 'sample'
allowed_domains = ['sample.com']
def start_requests(self):
urls = [
'https://www.sample.com/view-all-clothing/bottoms/leggings'
]
for url in urls:
yield SplashRequest(url=url, callback=self.parse)
def parse(self,response):
for item in response.css("li.product-compact"):
yield {
'category_link': response.request.url,
'title': item.css("a.pdp-link::text").extract()
}
pass
Docker容器
MINGW64 /c/Program Files/Docker Toolbox
$ docker container ls
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
75b69d937e79 scrapinghub/splash "python3 /app/bin/sp…" 16 minutes ago Up 16 minutes 5023/tcp, 127.0.0.1:8050->8050/tcp vigilant_chatterjee
仍然出现此错误
2018-07-10 15:18:35 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://127.0.0.1:8050/robots.txt> (failed 1 times): Connection was refused by other side: 10061: No connection could be made because the target machine actively refused it..
2018-07-10 15:18:36 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://127.0.0.1:8050/robots.txt> (failed 2 times): Connection was refused by other side: 10061: No connection could be made because the target machine actively refused it..
2018-07-10 15:18:37 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://127.0.0.1:8050/robots.txt> (failed 3 times): Connection was refused by other side: 10061: No connection could be made because the target machine actively refused it..
2018-07-10 15:18:37 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET http://127.0.0.1:8050/robots.txt>: Connection was refused by other side: 10061: No connection could be made because the target machine actively refused it..
ConnectionRefusedError: Connection was refused by other side: 10061: No connection could be made because the target machine actively refused it..
2018-07-10 15:18:38 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.sample.com/view-all-clothing/bottoms/leggings via http://127.0.0.1:8050/render.html> (failed 1 times): Connection was refused by other side: 10061: No connection could be made because the target machine actively refused it..
2018-07-10 15:18:39 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.sample.com/view-all-clothing/bottoms/leggings via http://127.0.0.1:8050/render.html> (failed 2 times): Connection was refused by other side: 10061: No connection could be made because the target machine actively refused it..
2018-07-10 15:18:40 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://www.sample.com/view-all-clothing/bottoms/leggings via http://127.0.0.1:8050/render.html> (failed 3 times): Connection was refused by other side: 10061: No connection could be made because the target machine actively refused it..
2018-07-10 15:18:40 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.sample.com/view-all-clothing/bottoms/leggings via http://127.0.0.1:8050/render.html>: Connection was refused by other side: 10061: No connection could be made because the target machine actively refused it..
2018-07-10 15:18:40 [scrapy.core.engine] INFO: Closing spider (finished)
我已经完成了所有已知的设置,但是无法确定我在哪里做错了。
请让我知道,因为我仍然不熟悉python,scrapy和splash JS渲染服务
答案 0 :(得分:0)
应该在 settings.py 中设置:
SPLASH_URL = 'http://0.0.0.0:8050'
并且docker容器应该是服务器的监听网卡。
0.0.0.0:8050->8050/tcp