Question

我正在尝试使用scrapy-在以下页面上飞溅：

https://www2.deloitte.com/us/en/misc/search.html#country=All#qr=accounting

内容是使用javascript动态创建的。为了允许页面创建中的任何延迟，我使用10秒的等待时间。

我的几次尝试之一如下：

class MySpider(scrapy.Spider):
name="test"

def start_requests(self):
    url = "https://www2.deloitte.com/us/en/misc/search.html#country=All#qr=accounting"
    splash_args = {
        'html': 1,
        'png': 1,
        'iframes': 1,
        'wait': 10
    }
    yield SplashRequest(url=url, callback=self.parse_result,endpoint='render.json', args=splash_args)


def parse_result(self, response):
    png_bytes = base64.b64decode(response.data['png'])
    with open('s1.png','wb') as f:
         f.write(png_bytes)
         f.close()

我尝试了其他变体，包括使用wait-for-element函数和其他端点的LUA脚本（选择render.json，因为我不确定是否可能存在一些iframe问题）。但是什么都没让我进入结果页面，而我只得到以下的纺车页面：

将网址复制粘贴到浏览器中，大约需要一秒钟的时间来加载结果。我不知道该在哪里寻找解决方案。

最后，完全让我失望的是在删除了国家/地区细分的情况下使用相同的脚本，即

https://www2.deloitte.com/us/en/misc/search.html#qr=accounting

一切正常，并且符合预期。但是，即使我使用此非国家/地区细分版本，当我尝试通过产生另一个包含更新的网址（包括＃p = 2）的SplashRequest来加载下一页时，同样的问题再次发生：我只是得到了一个空白的内容页面

系统是Windows 10上来自docker的最新Splash 3.1.1

Scrapy-Splash：无法加载动态创建的内容

0 个答案: