Scrapy不在DFS订单中爬行

时间:2015-09-11 22:26:52

标签: python multithreading scrapy depth-first-search

Scrapy似乎是以BFS顺序抓取页面,尽管文档说默认情况下订单应该是DFS。

这是我的蜘蛛:

import scrapy
from scrapy.http import FormRequest, Request


class DfsSpider(scrapy.Spider):
    name = 'dfs'
    allowed_domains = ['craigslist.org']
    start_urls = ['http://seattle.craigslist.org']

    def parse(self, response):
        print "URL FROM PARSE: ", response.url
        xpath = "//div[contains(@class, 'community')]/div/div/ul/li/a/@href"
        for link in response.xpath(xpath):
            url = response.urljoin(link.extract())
            yield Request(url, callback=self.parse_data)

    def parse_data(self, response):
        print "URL FROM PARSE_DATA: ", response.url
        xpath = "//div[contains(@class, 'content')]/p/span/span/a/@href"
        for link in response.xpath(xpath):
            url = response.urljoin(link.extract())
            yield Request(url, callback=self.parse_data_again)

    def parse_data_again(self, response):
        print "URL FROM PARSE_DATA_AGAIN: ", response.url

输出是单个" URL FROM PARSE:www.seattle.craigslist.org" 接下来是一堆" URL FROM PARSE_DATA:www.seattle.craigslist.org/search /..."

然后我才开始看到来自parse_data_again()方法的print语句。

如果Scrapy以DFS顺序搜索,我应该看到:

" URL FROM PARSE:..."

" URL FROM PARSE DATA:..."

" URL FROM PARSE DATA_AGAIN:..."

" URL FROM PARSE DATA_AGAIN:..."

...

" URL FROM PARSE DATA_AGAIN:..."

" URL FROM PARSE DATA:..."

" URL FROM PARSE DATA_AGAIN:..."

...

等等。现在,我怀疑Scrapy使用某种线程,这可能是为什么请求被发出并且响应以混乱顺序接收的原因。但是搜索树的不同部分的多个线程不是DFS ..

如果是这种情况,我可以将Scrapy设置为一次只处理一个请求吗?

或许我对其他事情感到困惑。感谢帮助。

0 个答案:

没有答案