Question

我正在尝试使用Splash for Scrapy在python中抓取一些动态网站。但是，我发现Splash无法等待在某些情况下加载整个页面。解决此问题的一种强力方法是添加大wait时间（例如，在下面的代码段中为5秒）。但是，这非常低效，仍然无法加载某些数据（有时加载内容需要超过5秒）。是否有某种等待条件可以通过这些请求进行处理？

yield SplashRequest(
          url, 
          self.parse, 
          args={'wait': 5},
          'User-Agent':"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36",
          }
)

Answer 1

是的，你可以写一个Lua脚本来做到这一点。这样的事情：

function main(splash)
  splash:set_user_agent(splash.args.ua)
  assert(splash:go(splash.args.url))

  -- requires Splash 2.3  
  while not splash:select('.my-element') do
    splash:wait(0.1)
  end
  return {html=splash:html()}
end

在Splash 2.3之前，您可以使用splash:evaljs('!document.querySelector(".my-element")')代替not splash:select('.my-element')。

将此脚本保存到变量（lua_script = """ ... """）。然后你可以发送这样的请求：

yield SplashRequest(
    url, 
    self.parse, 
    endpoint='execute',
    args={
        'lua_source': lua_script,
        'ua': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36"
    }
}

有关如何编写Splash Lua脚本的详细信息，请参阅脚本tutorial和reference。

Answer 2

我有类似的要求，有超时。我的解决方案略微修改了上述内容：

function wait_css(splash, css, maxwait)
    if maxwait == nil then
        maxwait = 10     --default maxwait if not given
    end

    local i=0
    while not splash:select(css) do
       if i==maxwait then
           break     --times out at maxwait secs
       end
       i=i+1
       splash:wait(1)      --each loop has duration 1sec
    end
end

Answer 3

您可以将lua脚本与javascript和noInternetViewController.presentingViewController?.dismiss(...)（documentation）结合使用。

splash:wait_for_resume

如果您不使用 scrapy-splash 插件，则对function main(splash, args) splash.resource_timeout = 60 assert(splash:go(splash.args.url)) assert(splash:wait(1)) splash.scroll_position = {y=500} result, error = splash:wait_for_resume([[ function main(splash) { var checkExist = setInterval(function() { if (document.querySelector(".css-selector").innerText) { clearInterval(checkExist); splash.resume(); } }, 1000); } ]], 30) assert(splash:wait(0.5)) return splash:html() end中的splash.args.url的关注会有所不同。

在python Scrapy中执行SplashRequest时添加等待元素

3 个答案: