我目前有一个基于蜘蛛的蜘蛛,我写的用于抓取start_urls
的输入JSON数组:
from scrapy.spider import Spider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from foo.items import AtlanticFirearmsItem
from scrapy.contrib.loader import ItemLoader
import json
import datetime
import re
class AtlanticFirearmsSpider(Spider):
name = "atlantic_firearms"
allowed_domains = ["atlanticfirearms.com"]
def __init__(self, start_urls='[]', *args, **kwargs):
super(AtlanticFirearmsSpider, self).__init__(*args, **kwargs)
self.start_urls = json.loads(start_urls)
def parse(self, response):
l = ItemLoader(item=AtlanticFirearmsItem(), response=response)
product = l.load_item()
return product
我可以从命令行中调用它,这样做很棒:
scrapy crawl atlantic_firearms -a start_urls='["http://www.atlanticfirearms.com/component/virtuemart/shipping-rifles/ak-47-receiver-aam-47-detail.html", "http://www.atlanticfirearms.com/component/virtuemart/shipping-accessories/nitride-ak47-7-62x39mm-barrel-detail.html"]'
但是,我正在尝试添加一个基于CrawlSpider的蜘蛛来抓取从中继承的整个网站并重新使用parse
方法逻辑。我的第一次尝试看起来像这样:
class AtlanticFirearmsCrawlSpider(CrawlSpider, AtlanticFirearmsSpider):
name = "atlantic_firearms_crawler"
start_urls = [
"http://www.atlanticfirearms.com"
]
rules = (
# I know, I need to update these to LxmlLinkExtractor
Rule(SgmlLinkExtractor(allow=['detail.html']), callback='parse'),
Rule(SgmlLinkExtractor(allow=[], deny=['/bro', '/news', '/howtobuy', '/component/search', 'askquestion'])),
)
使用
运行此蜘蛛scrapy crawl atlantic_firearms_crawler
抓取网站但从不解析任何项目。我认为这是因为CrawlSpider apparently has its own definition of parse
,所以不知何故我搞砸了。
当我将callback='parse'
更改为callback='parse_item'
并将parse
中的AtlanticFirearmsSpider
方法重命名为parse_item
时,效果非常好,抓取整个网站并解析项目成功。但是如果我再次尝试调用我原来的atlantic_firearms
蜘蛛,它会错误地显示NotImplementedError
,显然是因为基于蜘蛛的蜘蛛确实希望将解析方法定义为parse
。
对我来说,在这些蜘蛛之间重用逻辑的最佳方法是什么,以便我可以提供start_urls
的JSON数组以及进行全站点爬网?
答案 0 :(得分:4)
您可以避免多重继承。
将两只蜘蛛合并为一只蜘蛛。如果start_urls
将从命令行传递 - 它的行为就像CrawlSpider
,否则就像普通的蜘蛛一样:
from scrapy import Item
from scrapy.contrib.spiders import CrawlSpider, Rule
from foo.items import AtlanticFirearmsItem
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.linkextractors import LinkExtractor
import json
class AtlanticFirearmsSpider(CrawlSpider):
name = "atlantic_firearms"
allowed_domains = ["atlanticfirearms.com"]
def __init__(self, start_urls=None, *args, **kwargs):
if start_urls:
self.start_urls = json.loads(start_urls)
self.rules = []
self.parse = self.parse_response
else:
self.start_urls = ["http://www.atlanticfirearms.com/"]
self.rules = [
Rule(LinkExtractor(allow=['detail.html']), callback='parse_response'),
Rule(LinkExtractor(allow=[], deny=['/bro', '/news', '/howtobuy', '/component/search', 'askquestion']))
]
super(AtlanticFirearmsSpider, self).__init__(*args, **kwargs)
def parse_response(self, response):
l = ItemLoader(item=AtlanticFirearmsItem(), response=response)
product = l.load_item()
return product
或者,或者,只需将parse()
方法中的逻辑提取到库函数中,并从两个不相关的蜘蛛调用单独的蜘蛛。