Scrapy和Splash正确的设置,但仍然出现连接错误

时间:2018-07-10 07:31:01

标签: python scrapy splash-screen scrapy-splash scrapinghub

在我的 settings.py

SPLASH_URL = 'http://127.0.0.1:8050'
DOWNLOADER_MIDDLEWARES = {
  'scrapy_splash.SplashCookiesMiddleware': 723,
  'scrapy_splash.SplashMiddleware': 725,
  'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
  'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

我的Spider源代码

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider
from scrapy_splash import SplashRequest

class SampleSpider(CrawlSpider):
  name = 'sample'
  allowed_domains = ['sample.com']

  def start_requests(self):
    urls = [
      'https://www.sample.com/view-all-clothing/bottoms/leggings'
    ]

    for url in urls:
      yield SplashRequest(url=url, callback=self.parse)

  def parse(self,response):
    for item in response.css("li.product-compact"):
      yield {
        'category_link': response.request.url,
        'title': item.css("a.pdp-link::text").extract()
      }

  pass

Docker容器

MINGW64 /c/Program Files/Docker Toolbox
$ docker container ls
CONTAINER ID        IMAGE                COMMAND                  CREATED             STATUS              PORTS                                NAMES
75b69d937e79        scrapinghub/splash   "python3 /app/bin/sp…"   16 minutes ago      Up 16 minutes       5023/tcp, 127.0.0.1:8050->8050/tcp   vigilant_chatterjee

仍然出现此错误

2018-07-10 15:18:35 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://127.0.0.1:8050/robots.txt> (failed 1 times): Connection was refused by other side: 10061: No connection could be made because the target machine actively refused it..
2018-07-10 15:18:36 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://127.0.0.1:8050/robots.txt> (failed 2 times): Connection was refused by other side: 10061: No connection could be made because the target machine actively refused it..
2018-07-10 15:18:37 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://127.0.0.1:8050/robots.txt> (failed 3 times): Connection was refused by other side: 10061: No connection could be made because the target machine actively refused it..
2018-07-10 15:18:37 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET http://127.0.0.1:8050/robots.txt>: Connection was refused by other side: 10061: No connection could be made because the target machine actively refused it..
ConnectionRefusedError: Connection was refused by other side: 10061: No connection could be made because the target machine actively refused it..
2018-07-10 15:18:38 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.sample.com/view-all-clothing/bottoms/leggings via http://127.0.0.1:8050/render.html> (failed 1 times): Connection was refused by other side: 10061: No connection could be made because the target machine actively refused it..
2018-07-10 15:18:39 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.sample.com/view-all-clothing/bottoms/leggings via http://127.0.0.1:8050/render.html> (failed 2 times): Connection was refused by other side: 10061: No connection could be made because the target machine actively refused it..
2018-07-10 15:18:40 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://www.sample.com/view-all-clothing/bottoms/leggings via http://127.0.0.1:8050/render.html> (failed 3 times): Connection was refused by other side: 10061: No connection could be made because the target machine actively refused it..
2018-07-10 15:18:40 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.sample.com/view-all-clothing/bottoms/leggings via http://127.0.0.1:8050/render.html>: Connection was refused by other side: 10061: No connection could be made because the target machine actively refused it..
2018-07-10 15:18:40 [scrapy.core.engine] INFO: Closing spider (finished)

我已经完成了所有已知的设置,但是无法确定我在哪里做错了。

请让我知道,因为我仍然不熟悉python,scrapy和splash JS渲染服务

1 个答案:

答案 0 :(得分:0)

应该在 settings.py 中设置:

SPLASH_URL = 'http://0.0.0.0:8050'

并且docker容器应该是服务器的监听网卡。

0.0.0.0:8050->8050/tcp