Question

我是Scrapy-splash的新手，我想抓取一个懒惰的datatable，它是一个具有AJAX分页功能的表。

因此，我需要加载网站，等待执行JS，获取表格的html，然后单击分页上的“下一步”按钮。

我的方法行得通，但恐怕我两次要求访问该网站。

第一次生成SplashRequest时执行lua_script时。

是真的吗？如果是，如何使其仅执行一次请求？

class JSSpider(scrapy.Spider):
    name = 'js_spider'
    script = """
    function main(splash, args)
        splash:go(args.url)
        splash:wait(0.5)
        local page_one = splash:evaljs("$('#example').html()")
        splash:evaljs("$('#example_next').click()")
        splash:wait(2)
        local page_two = splash:evaljs("$('#example').html()")
        return {page_one=page_one,page_two=page_two}
    end"""

    def start_requests(self):
        url = f"""https://datatables.net/examples/server_side/defer_loading.html"""
        yield SplashRequest(url, endpoint='execute',callback=self.parse, args={'wait': 0.5,'lua_source':self.script,'url':url})

    def parse(self, response):
        # assert isinstance(response, SplashTextResponse)
        page_one = response.data.get('page_one',None)
        page_one_root = etree.fromstring(page_one, HTMLParser())
        page_two = response.data.get('page_two',None)
        page_two_root = etree.fromstring(page_one, HTMLParser())

编辑

我还想等到AJAX表现得比仅splash:wait(2)更好。是否有可能以某种方式等到表更改？理想的情况是超时。

Answer 1

Lua脚本非常文字化-如果您有1个splash:go，则1个启动程序会发出一个请求。
您的履带在这里很好。

但是，毫无意义的选择是：您的蜘蛛通过http连接到一个工作者，因此在理论上提出了两个请求：第一个请求启动服务，第二个请求启动工作者定位。

Scrapy-splash-lua_script中的splash：go（url）是否再次执行GET请求？

1 个答案: