我怎样才能将两个蜘蛛组合成一个?

时间:2019-03-17 09:03:54

标签: python scrapy

有两个蜘蛛使用相同的资源文件和几乎相同的结构。

spiderA包含:

import scrapy
import pkgutil

class StockSpider(scrapy.Spider):
    name = "spiderA"
    data = pkgutil.get_data("tutorial", "resources/webs.txt")
    data = data.decode()
    urls = data.split("\r\n")
    start_urls = [url + "string1"  for url in urls]

    def parse(self, response):
        pass

spiderB包含:

import scrapy
import pkgutil

class StockSpider(scrapy.Spider):
    name = "spiderB"
    data = pkgutil.get_data("tutorial", "resources/webs.txt")
    data = data.decode()
    urls = data.split("\r\n")
    start_urls = [url + "string2"  for url in urls]

    def parse(self, response):
        pass

如何结合SpiderA和SpiderB,并添加一个开关变量,让crapy scral根据需要调用不同的Spider?

2 个答案:

答案 0 :(得分:2)

尝试为蜘蛛类型添加单独的参数。您可以通过调用scrapy crawl myspider -a spider_type=second进行设置。检查此代码示例:

import scrapy
import pkgutil

class StockSpider(scrapy.Spider):
    name = "myspider"

    def start_requests(self):
        if not hasattr(self, 'spider_type'):
            self.logger.error('No spider_type specified')
            return
        data = pkgutil.get_data("tutorial", "resources/webs.txt")
        data = data.decode()

        for url in data.split("\r\n"):
            if self.spider_type == 'first':
                url += 'first'
            if self.spider_type == 'second':
                url += 'second'
            yield scrapy.Request(url)

    def parse(self, response):
        pass

您还可以始终创建基类,然后从基类继承,仅重载一个变量(添加到url)和名称(用于单独的调用)。

答案 1 :(得分:0)

spider_type导致错误

NameError: name 'spider_type' is not defined.

它是蜘蛛类中的self.spider_type。

import scrapy
import pkgutil

class StockSpider(scrapy.Spider):
    name = "myspider"

    def start_requests(self):
        if not hasattr(self, 'spider_type'):
            self.logger.error('No spider_type specified')
            return
        data = pkgutil.get_data("tutorial", "resources/webs.txt")
        data = data.decode()

        for url in data.split("\r\n"):
            if self.spider_type == 'first':
                url += 'first'
            if self.spider_type == 'second':
                url += 'second'
            yield scrapy.Request(url)

    def parse(self, response):
        pass

使其更加严格和准确。

scrapy crawl myspider -a spider_type='second'