我搜索了谷歌,并在堆栈溢出上看到了问题,但没有任何工作。我已经完成了
from scrapy.selector import HtmlXPathSelector
的建议
但没什么用,response.body 和 response.headers 效果不错但是 response.selector 和 response.xpath()给出的错误是说响应对象不存在这样的属性
我也无法导入Selector,因为scrapy目录层次结构中没有Selector
(不知道为什么)
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
for sel in response.xpath('//ul/li'):
title = sel.xpath('a/text()').extract()
link = sel.xpath('a/@href').extract()
desc = sel.xpath('text()').extract()
print title, link, desc
我正在使用 SCRAPY 0.16 (使用 Django Dynamic Scraper ,所以无法更新,因为它只与此版本兼容)
答案 0 :(得分:1)
您可能正在查看最新版本的文档。自0.16以来发生了很多变化。您应该查看0.16 http://doc.scrapy.org/en/0.16
的文档您的示例应如下所示:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ul/li')
for site in sites:
title = site.select('a/text()').extract()
link = site.select('a/@href').extract()
desc = site.select('text()').extract()
print title, link, desc
中所述