Question

我正在尝试编写一个获取javascript代码的小型Web解析器。为此，我尝试使用 ScrapyJS 来扩展Javscript的Scrapy。

我已按照the official repository上的安装说明操作。 Scrapy本身工作正常，但scrapyJS的第二个例子（获取HTML内容和截图：）没有。所以希望，我的问题会帮助其他人遇到同样的问题;）

我的设置和代码如下（如果需要）：

首先，我通过sudo -H pip install scrapyjs
然后，我运行以下命令：sudo docker run -p 5023:5023 -p 8050:8050 -p 8051:8051 scrapinghub/splash
之前，我更改了scrapy项目的settings.py。我添加了以下几行：

DOWNLOADER_MIDDLEWARES = { 'scrapyjs.SplashMiddleware': 725, } DUPEFILTER_CLASS = 'scrapyjs.SplashAwareDupeFilter' HTTPCACHE_STORAGE = 'scrapyjs.SplashAwareFSCacheStorage'
完整的python代码如下所示：

import json
import base64
import scrapy
class MySpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
    ]

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, self.parse_result, meta={
            'splash': {
                'args': {
                    'html': 1,
                    'png': 1,
                    'width': 600,
                    'render_all': 1,
                }
            }
        })

    def parse_result(self, response):
        data = json.loads(response.body_as_unicode())
        body = data['html']
        png_bytes = base64.b64decode(data['png'])
        print body

我收到以下错误：

2016-01-07 14:08:16 [scrapy] INFO: Enabled item pipelines: 
2016-01-07 14:08:16 [scrapy] INFO: Spider opened
2016-01-07 14:08:16 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-01-07 14:08:16 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-01-07 14:08:16 [scrapy] DEBUG: Retrying <POST http://127.0.0.1:8050/render.json> (failed 1 times): 400 Bad Request
2016-01-07 14:08:16 [scrapy] DEBUG: Retrying <POST http://127.0.0.1:8050/render.json> (failed 2 times): 400 Bad Request
2016-01-07 14:08:16 [scrapy] DEBUG: Gave up retrying <POST http://127.0.0.1:8050/render.json> (failed 3 times): 400 Bad Request
2016-01-07 14:08:16 [scrapy] DEBUG: Crawled (400) <POST http://127.0.0.1:8050/render.json> (referer: None)
2016-01-07 14:08:16 [scrapy] DEBUG: Ignoring response <400 http://127.0.0.1:8050/render.json>: HTTP status code is not handled or not allowed
2016-01-07 14:08:16 [scrapy] INFO: Closing spider (finished)

所以我实际上不知道错误在哪里。 Scrapy独自工作。如果我添加SPLASH_URL = 'http://192.168.59.103:8050'，我会收到超时错误。那时什么也没发生。 Localhost：8050既不工作。保留SPLASH_URL为空可以解决错误，但之后我得到了上面的错误。

Answer 1

你需要通过非零＆＃39;等待＆＃39;渲染完整的网页。

所以只需添加等待＆＃39;：0.5然后就可以了。

def start_requests(self):
    for url in self.start_urls:
        yield scrapy.Request(url, self.parse_result, meta={
        'splash': {
            'args': {
                'html': 1,
                'png': 1,
                'width': 600,
                'render_all': 1,
                'wait': 0.5
            }
        }
    })

Answer 2

也许你跳过了这部分？

安装Docker。

拉图像：


$ sudo docker pull scrapinghub / splash

http://splash.readthedocs.org/en/stable/install.html

来自官方Github的ScrapyJS示例未运行

2 个答案: