Question

我搜索了谷歌，并在堆栈溢出上看到了问题，但没有任何工作。我已经完成了

from scrapy.selector import selector error
还阅读了对from scrapy.selector import HtmlXPathSelector的建议但没什么用，

response.body 和 response.headers 效果不错但是 response.selector 和 response.xpath（）给出的错误是说响应对象不存在这样的属性

我也无法导入Selector，因为scrapy目录层次结构中没有Selector（不知道为什么）

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class DmozSpider(BaseSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        for sel in response.xpath('//ul/li'):
            title = sel.xpath('a/text()').extract()
            link = sel.xpath('a/@href').extract()
            desc = sel.xpath('text()').extract()
            print title, link, desc

我正在使用 SCRAPY 0.16 （使用 Django Dynamic Scraper ，所以无法更新，因为它只与此版本兼容）

Answer 1

您可能正在查看最新版本的文档。自0.16以来发生了很多变化。您应该查看0.16 http://doc.scrapy.org/en/0.16

的文档

您的示例应如下所示：

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

class DmozSpider(BaseSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//ul/li')
        for site in sites:
            title = site.select('a/text()').extract()
            link = site.select('a/@href').extract()
            desc = site.select('text()').extract()
            print title, link, desc

如教程http://doc.scrapy.org/en/0.16/intro/tutorial.html

中所述

scrapy 0.16，响应没有属性选择器，xpath（）

1 个答案: