如何修复Docker上Scrapy Splash中的“ 403” /“ 503”错误

时间:2018-12-29 14:39:40

标签: scrapy docker-compose splash scrapyd

当我在Docker上运行我的scrapy / splash配置时,由于403/503错误,scrapy 无法连接到Splash

这是我的docker-compose文件:

  scraper:
    build:
      context: ./my_scraper
    image: "my_scraper"
    ports:
      - 6800:6800
    depends_on:
      - splash
  splash:
    image: "scrapinghub/splash"
    ports:
      - 8050:8050
      - 5023:5023
    expose:
      - 8050
      - 5023

scraper的dockerfile看起来像这样:

FROM python:3.7

WORKDIR /my_scraper

COPY requirements.txt /tmp/
RUN pip install -r /tmp/requirements.txt  

COPY . .

EXPOSE 6800

CMD ["scrapyd", "--pidfile="]

我的启动设置为:

SPLASH_URL = 'http://splash:8050'

scraper中的其余设置是scrapy-splash文档所建议的。

但是,当我运行刮板时,它无法连接到Splash,并抛出此错误:

2018-12-29 13:24:14 [scrapy.core.engine] INFO: Spider opened
2018-12-29 13:24:14 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-12-29 13:24:14 [sorg] INFO: Spider opened: wiki_scraper
2018-12-29 13:24:14 [sorg] INFO: Spider opened: wiki_scraper
2018-12-29 13:24:14 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-12-29 13:24:14 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.wikipedia.org/ via http://splash:8050/render.html> (failed 1 times): 503 Service Unavailable
2018-12-29 13:24:14 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.wikipedia.org/ via http://splash:8050/render.html> (referer: None)
2018-12-29 13:24:14 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.wikipedia.org/>: HTTP status code is not handled or not allowed
2018-12-29 13:24:14 [scrapy.core.engine] INFO: Closing spider (finished)
2018-12-29 13:24:14 [scrapy.core.engine] ERROR: Scraper close failure

构建刮板映像时,我遇到了类似的问题,但是在dockerfile中使用暴露命令解决了该问题。但是,我为Splash使用了现成的图像,因此对于为什么无法连接到它有些困惑。任何帮助将不胜感激。

编辑:当我在SPLASH_URL设置中使用docker ip'192.168.0.14'时,出现另一个错误,指示(至少我认为...)由于超时而不是没有服务而失败:

2018-12-29 14:36:28 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.wikipedia.org/ via http://192.168.0.14:8050/render.html> (failed 2 times): 504 Gateway Time-out
2018-12-29 14:36:30 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-12-29 14:37:23 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://www.wikipedia.org/ via http://192.168.0.14:8050/render.html> (failed 3 times): User timeout caused connection failure: Getting http://192.168.0.14:8050/render.html took longer than 55.0 seconds..

0 个答案:

没有答案