Scrapy将两只蜘蛛放在单个文件中

时间:2016-03-23 08:50:06

标签: python scrapy scrapy-spider

我在单个文件中写了两个蜘蛛。当我运行scrapy runspider two_spiders.py时,只执行了第一个蜘蛛。如何在不将文件拆分为两个文件的情况下运行它们。

two_spiders.py:

import scrapy

class MySpider1(scrapy.Spider):
    # first spider definition
    ...

class MySpider2(scrapy.Spider):
    # second spider definition
    ...

我按照@llya回答

发生错误
2016-03-23 19:48:52 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
2016-03-23 19:48:52 [scrapy] INFO: Optional features available: ssl, http11
2016-03-23 19:48:52 [scrapy] INFO: Overridden settings: {}
2016-03-23 19:48:54 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
2016-03-23 19:48:54 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
2016-03-23 19:48:54 [scrapy] INFO: Optional features available: ssl, http11
2016-03-23 19:48:54 [scrapy] INFO: Optional features available: ssl, http11
...
scrapy runspider two_spiders.py

Traceback (most recent call last):
  File "/opt/pyenv.agutong-scrapy/bin/scrapy", line 11, in <module>
    sys.exit(execute())
  File "/opt/pyenv.agutong-scrapy/lib/python2.7/site-packages/scrapy/cmdline.py", line 143, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/opt/pyenv.agutong-scrapy/lib/python2.7/site-packages/scrapy/cmdline.py", line 89, in _run_print_help
    func(*a, **kw)
  File "/opt/pyenv.agutong-scrapy/lib/python2.7/site-packages/scrapy/cmdline.py", line 150, in _run_command
    cmd.run(args, opts)
  File "/opt/pyenv.agutong-scrapy/lib/python2.7/site-packages/scrapy/commands/runspider.py", line 89, in run
    self.crawler_process.start()
  File "/opt/pyenv.agutong-scrapy/lib/python2.7/site-packages/scrapy/crawler.py", line 251, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/opt/pyenv.agutong-scrapy/lib/python2.7/site-packages/twisted/internet/base.py", line 1193, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "/opt/pyenv.agutong-scrapy/lib/python2.7/site-packages/twisted/internet/base.py", line 1173, in startRunning
    ReactorBase.startRunning(self)
  File "/opt/pyenv.agutong-scrapy/lib/python2.7/site-packages/twisted/internet/base.py", line 684, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

2 个答案:

答案 0 :(得分:1)

让我们阅读documentation

  

在同一过程中运行多个蜘蛛

     

默认情况下,Scrapy运行一个   运行scrapy crawl时,每个进程都有一个蜘蛛。但是,Scrapy   支持使用internal API为每个进程运行多个蜘蛛。

     

以下是同时运行多个蜘蛛的示例:

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished

(文档中还有更多示例)

根据您的问题,您不清楚如何将两个蜘蛛放入一个文件中。仅使用单个蜘蛛连接两个文件的内容是不够的。

尝试执行文档中所写的操作。或至少向我们展示您的代码。没有它我们无法帮助你。

答案 1 :(得分:0)

在一个文件中搜索Scrapy项目时被带到这里。 这是一个完整的Scrapy项目,在一个文件中有两个蜘蛛。

# quote_spider.py
import json
import string

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.item import Item, Field


class TextCleaningPipeline(object):

    def _clean_text(self, text):
        text = text.replace('“', '').replace('”', '')
        table = str.maketrans({key: None for key in string.punctuation})
        new_text = text.translate(table)
        return new_text.lower()

    def process_item(self, item, spider):
        item['text'] = self._clean_text(item['text'])
        return item


class JsonWriterPipeline(object):

    def open_spider(self, spider):
        self.file = open(spider.settings['JSON_FILE'], 'a')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item


class QuoteItem(Item):
    text = Field()
    author = Field()
    tags = Field()
    spider = Field()


class QuotesSpiderOne(scrapy.Spider):
    name = "quotes1"

    def start_requests(self):
        urls = ['http://quotes.toscrape.com/page/1/', ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        for quote in response.css('div.quote'):
            item = QuoteItem()
            item['text'] = quote.css('span.text::text').get()
            item['author'] = quote.css('small.author::text').get()
            item['tags'] = quote.css('div.tags a.tag::text').getall()
            item['spider'] = self.name
            yield item


class QuotesSpiderTwo(scrapy.Spider):
    name = "quotes2"

    def start_requests(self):
        urls = ['http://quotes.toscrape.com/page/2/', ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        for quote in response.css('div.quote'):
            item = QuoteItem()
            item['text'] = quote.css('span.text::text').get()
            item['author'] = quote.css('small.author::text').get()
            item['tags'] = quote.css('div.tags a.tag::text').getall()
            item['spider'] = self.name
            yield item


if __name__ == '__main__':
    settings = dict()
    settings['USER_AGENT'] = 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
    settings['HTTPCACHE_ENABLED'] = True
    settings['JSON_FILE'] = 'items.jl'
    settings['ITEM_PIPELINES'] = dict()
    settings['ITEM_PIPELINES']['__main__.TextCleaningPipeline'] = 800
    settings['ITEM_PIPELINES']['__main__.JsonWriterPipeline'] = 801

    process = CrawlerProcess(settings=settings)
    process.crawl(QuotesSpiderOne)
    process.crawl(QuotesSpiderTwo)
    process.start()

安装Scrapy run之后

$ python quote_spider.py 

不需要其他文件

该示例与pycharm / vscode的图形调试器一起使用可以帮助您了解易用的工作流程并简化调试。