如何在继承的CrawlSpider中重用我的scrapy蜘蛛蜘蛛的解析方法?

时间:2015-01-22 02:45:22

标签: python web-scraping scrapy scrapy-spider

我目前有一个基于蜘蛛的蜘蛛,我写的用于抓取start_urls的输入JSON数组:

from scrapy.spider import Spider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from foo.items import AtlanticFirearmsItem
from scrapy.contrib.loader import ItemLoader

import json
import datetime
import re

class AtlanticFirearmsSpider(Spider):
    name = "atlantic_firearms"
    allowed_domains = ["atlanticfirearms.com"]

    def __init__(self, start_urls='[]', *args, **kwargs):
      super(AtlanticFirearmsSpider, self).__init__(*args, **kwargs)
      self.start_urls = json.loads(start_urls)

    def parse(self, response):
      l = ItemLoader(item=AtlanticFirearmsItem(), response=response)
      product = l.load_item()
      return product

我可以从命令行中调用它,这样做很棒:

scrapy crawl atlantic_firearms -a start_urls='["http://www.atlanticfirearms.com/component/virtuemart/shipping-rifles/ak-47-receiver-aam-47-detail.html", "http://www.atlanticfirearms.com/component/virtuemart/shipping-accessories/nitride-ak47-7-62x39mm-barrel-detail.html"]'

但是,我正在尝试添加一个基于CrawlSpider的蜘蛛来抓取从中继承的整个网站并重新使用parse方法逻辑。我的第一次尝试看起来像这样:

class AtlanticFirearmsCrawlSpider(CrawlSpider, AtlanticFirearmsSpider):
    name = "atlantic_firearms_crawler"
    start_urls = [
        "http://www.atlanticfirearms.com"
    ]
    rules = (
        # I know, I need to update these to LxmlLinkExtractor
        Rule(SgmlLinkExtractor(allow=['detail.html']), callback='parse'),
        Rule(SgmlLinkExtractor(allow=[], deny=['/bro', '/news', '/howtobuy', '/component/search', 'askquestion'])),
    )

使用

运行此蜘蛛
scrapy crawl atlantic_firearms_crawler

抓取网站但从不解析任何项目。我认为这是因为CrawlSpider apparently has its own definition of parse,所以不知何故我搞砸了。

当我将callback='parse'更改为callback='parse_item'并将parse中的AtlanticFirearmsSpider方法重命名为parse_item时,效果非常好,抓取整个网站并解析项目成功。但是如果我再次尝试调用我原来的atlantic_firearms蜘蛛,它会错误地显示NotImplementedError,显然是因为基于蜘蛛的蜘蛛确实希望将解析方法定义为parse

对我来说,在这些蜘蛛之间重用逻辑的最佳方法是什么,以便我可以提供start_urls的JSON数组以及进行全站点爬网?

1 个答案:

答案 0 :(得分:4)

您可以避免多重继承

将两只蜘蛛合并为一只蜘蛛。如果start_urls将从命令行传递 - 它的行为就像CrawlSpider,否则就像普通的蜘蛛一样:

from scrapy import Item
from scrapy.contrib.spiders import CrawlSpider, Rule

from foo.items import AtlanticFirearmsItem
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.linkextractors import LinkExtractor

import json


class AtlanticFirearmsSpider(CrawlSpider):
    name = "atlantic_firearms"
    allowed_domains = ["atlanticfirearms.com"]

    def __init__(self, start_urls=None, *args, **kwargs):
        if start_urls:
            self.start_urls = json.loads(start_urls)
            self.rules = []
            self.parse = self.parse_response
        else:
            self.start_urls = ["http://www.atlanticfirearms.com/"]
            self.rules = [
                Rule(LinkExtractor(allow=['detail.html']), callback='parse_response'),
                Rule(LinkExtractor(allow=[], deny=['/bro', '/news', '/howtobuy', '/component/search', 'askquestion']))
            ]

        super(AtlanticFirearmsSpider, self).__init__(*args, **kwargs)

    def parse_response(self, response):
        l = ItemLoader(item=AtlanticFirearmsItem(), response=response)
        product = l.load_item()
        return product

或者,或者,只需将parse()方法中的逻辑提取到库函数中,并从两个不相关的蜘蛛调用单独的蜘蛛。