scrapy 0.16,响应没有属性选择器,xpath()

时间:2014-08-14 06:47:49

标签: scrapy response web-crawler

我搜索了谷歌,并在堆栈溢出上看到了问题,但没有任何工作。我已经完成了

  

response.body response.headers 效果不错但是 response.selector response.xpath()给出的错误是说响应对象不存在这样的属性

我也无法导入Selector,因为scrapy目录层次结构中没有Selector(不知道为什么)

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class DmozSpider(BaseSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        for sel in response.xpath('//ul/li'):
            title = sel.xpath('a/text()').extract()
            link = sel.xpath('a/@href').extract()
            desc = sel.xpath('text()').extract()
            print title, link, desc

我正在使用 SCRAPY 0.16 (使用 Django Dynamic Sc​​raper ,所以无法更新,因为它只与此版本兼容)

1 个答案:

答案 0 :(得分:1)

您可能正在查看最新版本的文档。自0.16以来发生了很多变化。您应该查看0.16 http://doc.scrapy.org/en/0.16

的文档

您的示例应如下所示:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

class DmozSpider(BaseSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//ul/li')
        for site in sites:
            title = site.select('a/text()').extract()
            link = site.select('a/@href').extract()
            desc = site.select('text()').extract()
            print title, link, desc

如教程http://doc.scrapy.org/en/0.16/intro/tutorial.html

中所述