Question

现在我正在尝试使用 scrapy 来抓取网站。

我发现给定相同的url，请求的响应可能不同。它似乎是网站的两个版本。我也使用了相同的用户代理。

有没有办法让回应保持一致？或者我只能分析每个响应的版本，然后使用不同的XPath提取项目？

来自scrapy shell的response.headers是这样的：



{'Cache-Control': 'max-age=0, private, must-revalidate',
 'Content-Type': 'text/html; charset=utf-8',
 'Date': 'Fri, 04 Dec 2015 18:56:59 GMT',
 'Server': 'nginx/1.6.2',
 'Set-Cookie': 'auth_token=hello; domain=www.medhelp.org; path=/; expires=Thu, 01-Jan-1970 00:00:00 GMT',
 'X-Rack-Cache': 'miss',
 'X-Request-Id': '70f23a01ac124fd58acc9e9e7bafb609',
 'X-Runtime': '0.150452',
 'X-Ua-Compatible': 'IE=8'}

{'Cache-Control': 'max-age=0, private, must-revalidate', 'Content-Type': 'text/html; charset=utf-8', 'Date': 'Fri, 04 Dec 2015 18:56:59 GMT', 'Server': 'nginx/1.6.2', 'Set-Cookie': 'auth_token=hello; domain=www.medhelp.org; path=/; expires=Thu, 01-Jan-1970 00:00:00 GMT', 'X-Rack-Cache': 'miss', 'X-Request-Id': '70f23a01ac124fd58acc9e9e7bafb609', 'X-Runtime': '0.150452', 'X-Ua-Compatible': 'IE=8'}

Answer 1

完全取决于网站，而不是scrapy。在这种情况下可能有用的东西可能是检查response.headers，特别是应该返回的Last-Modified标题，其中包含上次修改日期信息。

如何防止同一网址返回不同的响应？

1 个答案: