Question

我正在尝试使用Scrapy-Splash使用'render.png'端点获取网站的屏幕截图（实际上我会在发生某些异常后在我的蜘蛛中执行此操作，并且我想查看网站的外观对他们而言）。

我遇到的问题是响应似乎不是有效的PNG。 scrapy shell中的最小例子是：

from scrapy_splash import SplashRequest

url='http://www.waitrose.com'

args={'wait': 2, 'width': 320, 'timeout': 60, 'render_all': 1}

endpoint='render.png'

# I also tried with dont_send_headers=True, dont_process_response=True
sr=SplashRequest(url=url, args=args, endpoint=endpoint)

fetch(sr)

当然，您需要运行本地启动服务器来执行此操作（请参阅here）

响应标题是

{'Content-Type': 'image/png',
 'Date': 'Mon, 10 Apr 2017 21:23:48 GMT',
 'Server': 'TwistedWeb/16.1.1'}

但身体开始像

In [16]: response.body[:100]
Out[16]: '<html><head></head><body>\xe2\x80\xb0PNG\n\x1a\n\nIHDR\x01@\x04\xc2\xad\x08\x065r\xe2\x80\x9aQ\tpHYs\x0fa\x0fa\x01\xc2\xa8?\xc2\xa7i IDATx\x01\xc3\xac\xc2\xbd\x07\xc5\x93\\\xc3\x97u\xc3\xa6y\xc2\xaa\xc2\xbab\xc3\xa7\xc5\x93\xc3\x91'

甚至在修剪html标签并保存到文件后，我的系统显示无效的PNG。

另一方面，如果我使用python-requests模块，如

import requests                                                                     
base_url = "http://localhost:8050/render.png"
params = {'url': 'http://www.waitrosecellar.com',
          'wait': 2,
          'width': 320,
          'timeout': 60,
          'render_all': 1}
response2 = requests.get(base_url, params)

我没有问题。响应内容就像

一样开始

In [19]: response2.content[:100]
Out[19]: '\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x01@\x00\x00\x03)\x08\x06\x00\x00\x00u\xf4\xea\x11\x00\x00\x00\tpHYs\x00\x00\x0fa\x00\x00\x0fa\x01\xa8?\xa7i\x00\x00 \x00IDATx\x01\xec\xbd\x07\x9c]\xc7u\xdf\x7f\xb6\x17\xec\xa2\xf7\xba(\x04A\x80`\x17\x8bH\x90\x14\x9bHY\xdd\x92l\xc9\x92\xab\\\x92'

标题是

In [20]: response2.headers
Out[20]: {'Transfer-Encoding': 'chunked', 'Date': 'Mon, 10 Apr 2017 21:39:17 GMT', 'Content-Type': 'image/png', 'Server': 'TwistedWeb/16.1.1'}

并保存文件会生成一个有效的PNG图像，我可以在我的系统上查看。

SplashRequest是怎么回事搞乱PNG？

我使用scrapy docs中的screenshot pipline也发现了完全相同的问题。

编辑：有趣的是，如果我在中间件process_response中设置断点，则response.body在该阶段是有效的PNG。

Answer 1

原来这是我在链中的一些beautifulsoup html解析器中间件，其'process_response'方法搞乱了png字节。

Scrapy SplashRequest和破坏的PNG

1 个答案: