Question

我无法将scrapy配置为以深度运行＆gt; 1，我尝试了以下3个选项，其中没有人工作，摘要日志中的request_depth_max始终为1：

1）添加：

from scrapy.conf import settings
settings.overrides['DEPTH_LIMIT'] = 2

到蜘蛛文件（网站上的示例，只是使用不同的网站）

2）使用-s选项运行命令行：

/usr/bin/scrapy crawl -s DEPTH_LIMIT=2 mininova.org

3）添加到settings.py和scrapy.cfg：

DEPTH_LIMIT=2

如何配置为超过1？

Answer 1

warwaruk是对的，DEPTH_LIMIT设置的默认值为0 - 即“不施加限制”。

让我们刮掉miniova，看看会发生什么。从today页面开始，我们看到有两个链接：

stav@maia:~$ scrapy shell http://www.mininova.org/today
2012-08-15 12:27:57-0500 [scrapy] INFO: Scrapy 0.15.1 started (bot: scrapybot)
>>> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
>>> SgmlLinkExtractor(allow=['/tor/\d+']).extract_links(response)
[Link(url='http://www.mininova.org/tor/13204738', text=u'[APSKAFT-018] Apskaft presents: Musique Concrte', fragment='', nofollow=False), Link(url='http://www.mininova.org/tor/13204737', text=u'e4g020-graphite412', fragment='', nofollow=False)]

让我们抓住第一个链接，我们看到该页面上没有新的链接，只是指向iteself的链接，默认情况下不会重新抓取（scrapy.http.Request（url [，... dont_filter =错，......]））：

>>> fetch('http://www.mininova.org/tor/13204738')
2012-08-15 12:30:11-0500 [default] DEBUG: Crawled (200) <GET http://www.mininova.org/tor/13204738> (referer: None)
>>> SgmlLinkExtractor(allow=['/tor/\d+']).extract_links(response)
[Link(url='http://www.mininova.org/tor/13204738', text=u'General information', fragment='', nofollow=False)]

那里没有运气，我们仍处于深度1.让我们尝试其他链接：

>>> fetch('http://www.mininova.org/tor/13204737')
2012-08-15 12:31:20-0500 [default] DEBUG: Crawled (200) <GET http://www.mininova.org/tor/13204737> (referer: None)
[Link(url='http://www.mininova.org/tor/13204737', text=u'General information', fragment='', nofollow=False)]

不，这个页面只包含一个链接，一个链接到自己，也被过滤。所以实际上没有刮擦的链接，因此Scrapy关闭蜘蛛（深度== 1）。

Answer 2

我遇到了类似的问题，在定义Rule时有助于设置follow=True：

follow是一个布尔值，指定是否应该遵循链接使用此规则提取的每个响应。如果callback为None follow 默认为True，否则默认为False。

Answer 3

DEPTH_LIMIT设置的默认值为0 - 即“不会施加限制”。

您写道：

摘要日志中的
request_depth_max始终为1

您在日志中看到的是统计信息，而不是设置。当它说request_depth_max为1时，意味着从第一次回调中没有产生其他请求。

您必须显示您的蜘蛛代码才能了解正在发生的事情。

但是要为它创造另一个问题。

更新：

啊，我发现你正在为scrapy intro运行mininova spider：

class MininovaSpider(CrawlSpider):

    name = 'mininova.org'
    allowed_domains = ['mininova.org']
    start_urls = ['http://www.mininova.org/today']
    rules = [Rule(SgmlLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]

    def parse_torrent(self, response):
        x = HtmlXPathSelector(response)

        torrent = TorrentItem()
        torrent['url'] = response.url
        torrent['name'] = x.select("//h1/text()").extract()
        torrent['description'] = x.select("//div[@id='description']").extract()
        torrent['size'] = x.select("//div[@id='info-left']/p[2]/text()[2]").extract()
        return torrent

正如您从代码中看到的那样，蜘蛛永远不会对其他页面发出任何请求，它会从顶层页面中删除所有数据。这就是最大深度为1的原因。

如果你让自己的蜘蛛跟随其他页面的链接，最大深度将大于1。

不能爬行超过1的scrapy

3 个答案: