Question

这些是在Ipython中运行的代码。

from scrapy.selector import Selector
from scrapy.http import HtmlResponse

response = HtmlResponse(url='https://en.wikipedia.org/wiki/Pan_American_Games')
datas = Selector(response=response).xpath('//div[@class="thumb tleft"]')

当我执行response时，我得到<200 https://en.wikipedia.org/wiki/Pan_American_Games> 但是当我执行reponse.body时，我得到了''（NULL）

似乎HtmlResponse没有检索此页面的任何HTML信息。

有没有人知道如何解决这个问题？

仅供参考，如果我在命令提示符下运行$ scrapy shell https://en.wikipedia.org/wiki/Pan_American_Games，则响应将不为NULL。我不想做scrapy shell url方式，因为我将在URL列表中循环运行。

由于

Answer 1

问题是你不是在这里写蜘蛛。 HtmlResponse没有从互联网上检索任何数据。你拥有的只是一个响应对象，只有你提供的url属性。

这里是对scrapy架构的正式描述：http://doc.scrapy.org/en/latest/topics/architecture.html?highlight=scrapy%20architecture

但是，如果您确实想要使用scrapy功能，例如没有scrapy蜘蛛的选择器，您可以使用requests检索页面并继续使用sc selectors，item loaders等。虽然这是不推荐的方法，因为你会错过scrapy提供的所有功能。

初学者官方scrapy教程：http://doc.scrapy.org/en/latest/intro/tutorial.html

Answer 2

您确定要为此使用Scrapy吗？因为如果你这样做，你应该真正遵循教程并使用蜘蛛。我很确定这不是使用Scrapy的方法。

如果您只想在python 2中使用基本的刮刀，我建议如下：

from urllib2 import urlopen
from lxml import html

response = urlopen('https://en.wikipedia.org/wiki/Pan_American_Games')
page = html.fromstring(response.read())
datas = page.xpath('//div[@class="thumb tleft"]')

Scrapy的HtmlResponse不会从URL检索数据

2 个答案: