使用Scrapy时出现奇怪的错误

时间:2014-12-26 11:50:47

标签: python scrapy web-crawler

我正在关注学习scrapy的this教程,但我遇到了一个非常奇怪的问题。它会提取网址start_urls并将其放在data.json中。这是我使用的代码:

import scrapy
from scrapy.contrib.spiders import Rule
from scrapy.contrib.linkextractors import LinkExtractor

class ArticleItem(scrapy.Item):
    url = scrapy.Field()

class ScholarSpider(scrapy.Spider):
    name = "scholar"
    allowed_domains = ["mininova.org/"]
    start_urls = ['http://www.mininova.org/today']
    rules = [Rule(LinkExtractor(allow=['/tor/13278067'], deny=['http://www.mininova.org/today']), 'parse')]

def parse(self, response):
    article = ArticleItem()
    article['url'] = response.url
    return article

出于开发目的,我尝试仅提取网址,这也是一个以/tor/13278067结尾的非常具体的网址。我按照以下方式跑蜘蛛:

$ scrapy crawl scholar -o data.json

我发现data.json中的内容是:

[{"url": "http://www.mininova.org/today"}]

1 个答案:

答案 0 :(得分:1)

要使用rules,您需要继承CrawlSpider而不是Spider。但是,从CrawlSpider继承时,您不能覆盖parse()方法,而是为回调使用不同的名称。所以基本上,这就是你需要的:

import scrapy
from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.contrib.linkextractors import LinkExtractor

class ArticleItem(scrapy.Item):
    url = scrapy.Field()

# inherit from CrawlSpider instead of Spider
class ScholarSpider(CrawlSpider):
    name = "scholar"
    allowed_domains = ["mininova.org"]
    start_urls = ['http://www.mininova.org/today']
    rules = [Rule(LinkExtractor(allow=['/tor/\d+']), callback='parse_url')]

    # do not use parse() because CrawlSpider
    # needs it for his normal operation
    def parse_url(self, response):
        article = ArticleItem()
        article['url'] = response.url
        return article