Question

我正在尝试使用Splash和Scrapy-Splash python模块为我的某些网页提供即时报告。

问题是我无法像启动render.json中那样获得corect最后一个URL。网站重定向时。例如，在localhost：8050 / render.json上，呈现www.google.com的结果是：

{"requestedUrl": "http://www.google.com/", 
"url": "https://www.google.com/?gws_rd=ssl", 
"title": "Google", "geometry": [0, 0, 1024, 768]}

但是在我的python脚本中，我只能获得“ http://www.google.com”

我的代码是：

    def start_requests(self):
        return [Request(self.url, callback=self.parse, dont_filter=True)]

    def parse(self, response):
        splash_args = { 'wait': 1 }
        return SplashRequest(
            response.url,
            self.parse_link,
            args=splash_args,
            endpoint='render.json',
            ) 

    def parse_link(self, response):
        result = {
            'requested_url': response.data['requestedUrl'],
            'real_url': response.data['url'],
            'response': response.request.url,
            'splash_url': response.real_url
            }

但是其中任何一个返回：

{"requested_url": "http://www.google.com/", 
 "real_url": "http://www.google.com/", 
 "response": "http://127.0.0.1:8050/render.json", 
 "splash_url": "http://127.0.0.1:8050/render.json"}

有什么办法可以克服这个问题？

Scrapy-Splash请求的URL与真实的URL

0 个答案: