Question

我们一直在使用scrapy-splash middleware通过在docker容器内运行的Splash javascript引擎传递已删除的HTML源代码。

如果我们想在蜘蛛中使用Splash，我们会配置多个required project settings并生成Request，指定具体的meta arguments：

yield Request(url, self.parse_result, meta={
    'splash': {
        'args': {
            # set rendering arguments here
            'html': 1,
            'png': 1,

            # 'url' is prefilled from request url
        },

        # optional parameters
        'endpoint': 'render.json',  # optional; default is render.json
        'splash_url': '<url>',      # overrides SPLASH_URL
        'slot_policy': scrapyjs.SlotPolicy.PER_DOMAIN,
    }
})

这有助于记录。但是，我们如何在Scrapy Shell内使用scrapy-splash？

Answer 1

只需将您要封装的网址包装到splash http api。

所以你会想要这样的东西：

scrapy shell 'http://localhost:8050/render.html?url=http://domain.com/page-with-javascript.html&timeout=10&wait=0.5'

其中localhost:port是您的启动服务运行的地方
url是您要抓取的网址，不要忘记 urlquote 它！
render.html是可能的http api端点之一，在这种情况下返回retml html页面超时的timeout时间以秒为单位 wait在读取/保存html之前等待javascript执行的时间（以秒为单位）。

Answer 2

您可以在已配置的Scrapy项目中运行scrapy shell不带参数，然后创建req = scrapy_splash.SplashRequest(url, ...)并调用fetch(req)。

Answer 3

对于使用Docker Toolbox的Windows用户：

将单个倒数逗号更改为双倒数逗号，以防止发生invalid hostname:http错误。
将localhost更改为鲸鱼徽标下方的docker ip地址。对我来说是192.168.99.100。

最后我得到了：

scrapy shell "http://192.168.99.100:8050/render.html?url="https://samplewebsite.com/category/banking-insurance-financial-services/""

Scrapy壳和Scrapy飞溅

3 个答案: