我想对通过在网页中向下滚动生成的内容进行逆向工程。问题出在网址https://www.crowdfunder.com/user/following_page/80159?user_id=80159&limit=0&per_page=20&screwrand=933
中。 screwrand
似乎没有遵循任何模式,因此撤销网址不起作用。我正在考虑使用Splash进行自动渲染。如何使用Splash滚动浏览器?非常感谢!
以下是两个请求的代码:
request1 = scrapy_splash.SplashRequest('https://www.crowdfunder.com/user/following/{}'.format(user_id),
self.parse_follow_relationship,
args={'wait':2},
meta={'user_id':user_id, 'action':'following'},
endpoint='http://192.168.99.100:8050/render.html')
yield request1
request2 = scrapy_splash.SplashRequest('https://www.crowdfunder.com/user/following_user/80159?user_id=80159&limit=0&per_page=20&screwrand=76',
self.parse_tmp,
meta={'user_id':user_id, 'action':'following'},
endpoint='http://192.168.99.100:8050/render.html')
yield request2
答案 0 :(得分:9)
要滚动页面,您可以编写自定义渲染脚本(请参阅http://splash.readthedocs.io/en/stable/scripting-tutorial.html),如下所示:
function main(splash)
local num_scrolls = 10
local scroll_delay = 1.0
local scroll_to = splash:jsfunc("window.scrollTo")
local get_body_height = splash:jsfunc(
"function() {return document.body.scrollHeight;}"
)
assert(splash:go(splash.args.url))
splash:wait(splash.args.wait)
for _ = 1, num_scrolls do
scroll_to(0, get_body_height())
splash:wait(scroll_delay)
end
return splash:html()
end
要渲染此脚本,请执行'执行'端点而不是render.html端点:
script = """<Lua script> """
scrapy_splash.SplashRequest(url, self.parse,
endpoint='execute',
args={'wait':2, 'lua_source': script}, ...)
答案 1 :(得分:0)
感谢Mikhail,我尝试了您的滚动脚本,它可以正常工作,但是我还注意到您的脚本滚动了太多次,有些js的渲染时间也不够,因此被跳过了,所以我做了一些如下更改:< / p>
function main(splash)
local num_scrolls = 10
local scroll_delay = 1
local scroll_to = splash:jsfunc("window.scrollTo")
local get_body_height = splash:jsfunc(
"function() {return document.body.scrollHeight;}"
)
assert(splash:go(splash.args.url))
splash:wait(splash.args.wait)
for _ = 1, num_scrolls do
local height = get_body_height()
for i = 1, 10 do
scroll_to(0, height * i/10)
splash:wait(scroll_delay/10)
end
end
return splash:html()
end