Question

我使用的是Python2.7和Scrapy 1.0.4。以下爬网在Shell中逐一进行测试并正常工作。然而，当我把它们放在一起时，似乎Scrapy在第一级之后不会更深入。

import scrapy

class trbSpider(scrapy.Spider):
name = "trb"
allowed_domains = ["dmoz.org"]
start_urls = [
    "http://www.sciencedirect.com/science/journal/01912615",
]

def parse(self, response):
    print '------ crawling root dir ------'
    for href in response.css('a.volLink::attr("href")'):
        url = response.urljoin(href.extract())
        print url
        yield scrapy.Request(url, self.parse_volume)

def parse_volume(self, response):
    print '------ crawling sub dir ------'
    for href in response.css('div.currentVolumes a::attr("href")'):
        url = response.urljoin(href.extract())
        yield scrapy.Request(url, callback=self.parse_page)

def parse_page(self, response):
    print '------ crawing authors name'
    for authors in response.css('li.authors::text'):
        yield {'authors': authors.extract()}

Answer 1

这可能是你有这种行为，因为这一行：

allowed_domains = ["dmoz.org"]

您应该将其删除或使用：

allowed_domains = ["sciencedirect.com"]

请注意，allowed_domains不会影响start_urls，但会过滤掉蜘蛛回调输出的任何其他网址。

Python scrapy不会更深入

1 个答案: