Question

我是python的新手，但是为了使用Scrapy进行工作，我试图了解它。

我目前正在关注本教程： http://scrapy2.readthedocs.io/en/latest/intro/tutorial.html

我在使用这部分时遇到了麻烦（来自教程）：

def parse(self, response):
    for sel in response.xpath('//ul/li'):
        title = sel.xpath('a/text()').extract()
        link = sel.xpath('a/@href').extract()
        desc = sel.xpath('text()').extract()
        print title, link, desc

我遇到的问题是for sel in response.xpath('//ul/li'):部分。我理解这条线基本上缩小了被抓取到与xpath //ul/li匹配的任何内容。

但是，在我的实现中，我无法将页面缩小到一个单一部分。我试图通过选择整个HTML来解决这个问题，请参阅下面的尝试：

   def parse(self, response):
           for sel in response.xpath('//html'):
            title = sel.xpath('//h1/text()').extract()
            author = sel.xpath('//head/meta[@name="author"]/@content').extract()
            mediumlink = sel.xpath('//head/link[@rel="author"]/@href').extract()
            print title, author, mediumlink

xpath可以在我使用的Chrome插件中使用，也可以在response.xpath('//title').extract()中使用scrapy shell

我已经尝试将线路更改为：

for sel in response.xpath('//html'):和for sel in response.xpath('html'):

但每次我都明白：

2016-10-16 14:33:43 [scrapy] ERROR: Spider error processing <GET https://medium.com/swlh/how-our-app-went-from-20-000-day-to-2-day-in-revenue-d6892a2801bf#.smmwwqxlf> (referer: None)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 587, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/Users/Matthew/Sites/crawl/tutorial/tutorial/spiders/medium_Spider.py", line 11, in parse
    for sel in response:
TypeError: 'HtmlResponse' object is not iterable

有人可以给我一些关于如何最好地解决这个问题的建议吗？放轻松我，我的技能不是那么热。谢谢！

Answer 1

如错误消息所示

 for sel in response:

您尝试在{strong>第11行的response文件中遍历medium_Spider.py对象。

但是response是HtmlResponse而不是可迭代的，您可以在for循环中使用。你错过了response上的一些方法调用。尝试按照你在问题中写的那样进行循环：

for sel in response.xpath('//html'):

.xpath('//html')返回可在for循环中使用的iterable。

TypeError：'HtmlResponse'对象不可迭代

1 个答案: