Question

我很好奇，看看是否有任何飞溅内容可以从此页面-https://nreca.csod.com/ux/ats/careersite/4/home?c=nreca#/requisition/182

中获取动态工作内容。

为了使Splash接收URL片段，您必须使用SplashRequest。为了使其能够处理JS cookie，我不得不使用lua脚本。下面是我的环境，脚本和易破解的代码。

该网站似乎分3个“步骤”呈现：

带有脚本标记的html基本为空
上面的脚本运行并生成网站的页眉/页脚，并检索另一个脚本
＃2中的脚本运行并与JS set cookie一起检索动态内容（我要抓取的工作）

如果您对网址进行简单的GET操作（即在邮递员中），则只会看到第1步的内容。与飞溅我只得到步骤2（页眉/页脚）的结果。我做在response.cookiejar

中看到了JS cookie。

我无法获得要呈现的动态作业内容（步骤3）。

环境：

草率的1.3.3 刮擦0.72 settings

    script = """
        function main(splash)
          splash:init_cookies(splash.args.cookies)
          assert(splash:go{
            splash.args.url,
            headers=splash.args.headers,
            http_method=splash.args.http_method,
            body=splash.args.body,
            })
          assert(splash:wait(15))

          local entries = splash:history()
          local last_response = entries[#entries].response
          return {
            url = splash:url(),
            headers = last_response.headers,
            http_status = last_response.status,
            cookies = splash:get_cookies(),
            html = splash:html(),
          }
        end
    """

    return SplashRequest('https://nreca.csod.com/ux/ats/careersite/4/home?c=nreca#/requisition/182', 
        self.parse_detail, 
        endpoint='execute',
        cache_args=['lua_source'],
        args={
            'lua_source': script,
            'wait': 10,
            'headers': {'User-Agent': 'Mozilla/5.0'}
        },
    )

Answer 1

这是默认情况下在私有浏览模式下运行启动的问题（特别是不允许访问window.localStorage）。这通常会导致javascript异常的发生。尝试从--disable-private-mode选项开始启动，或参考以下文档条目：http://splash.readthedocs.io/en/stable/faq.html#disable-private-mode。

Scrapy-splash无法呈现某个反应驱动网站的动态内容

1 个答案: