从RabbitMQ抓取消息,然后将解析的数据保存到MongoDB

时间:2018-11-30 16:07:44

标签: python mongodb scrapy rabbitmq

我想让Scrapy从RabbitMQ获取消息,然后将解析后的数据保存到MongoDB。

我正在使用scrapy-rabbitmq-link来从RabbitMQ接收消息。我在settings.py

中具有以下设置
# Enables scheduling storing requests queue in rabbitmq.
SCHEDULER = "scrapy_rabbitmq_link.scheduler.SaaS"

# Provide AMQP connection string
RABBITMQ_CONNECTION_PARAMETERS = 'amqp://admin:password@localhost:5672/'

# Set response status codes to requeue messages on
SCHEDULER_REQUEUE_ON_STATUS = [500]

# Middleware acks RabbitMQ message on success
DOWNLOADER_MIDDLEWARES = {
    'scrapy_rabbitmq_link.middleware.RabbitMQMiddleware': 10
}

当前蜘蛛

class TestSpider(scrapy.Spider):
    name = "TestSpider"
    amqp_key = 'news'

    def _make_request(self, mframe, hframe, body):
        print('######## url ##########')
        item = json.loads(body)
        articleLink = item['url']
        print(articleLink)
        return scrapy.Request(articleLink, callback=self.parse)

    def parse(self, response):
        print('######## Parse article ##########')
        articleBody = response.css('div.StandardArticleBody_body ::text').extract_first()
        print('######## Article body ##########')
        print(articleBody)

        yield {
            'body': articleBody
        }  

目前,我可以从RabbitMQ接收消息,但是不知道如何将项目保存到MongoDB

如果我添加

item = TutorialItem()
item['title'] = article.css('h3.storytitle::text').extract_first().strip()
item['body'] = articleBody.strip()
yield item

在Spider的parse函数中,并将项保存到水管中的MongoDB中。我发现它总是向RabbitMQ添加新消息,而不是消耗消息。我觉得这与SCHEDULER = "scrapy_rabbitmq_link.scheduler.SaaS"

有关

有什么建议吗?

0 个答案:

没有答案