我想让Scrapy从RabbitMQ获取消息,然后将解析后的数据保存到MongoDB。
我正在使用scrapy-rabbitmq-link来从RabbitMQ接收消息。我在settings.py
中具有以下设置# Enables scheduling storing requests queue in rabbitmq.
SCHEDULER = "scrapy_rabbitmq_link.scheduler.SaaS"
# Provide AMQP connection string
RABBITMQ_CONNECTION_PARAMETERS = 'amqp://admin:password@localhost:5672/'
# Set response status codes to requeue messages on
SCHEDULER_REQUEUE_ON_STATUS = [500]
# Middleware acks RabbitMQ message on success
DOWNLOADER_MIDDLEWARES = {
'scrapy_rabbitmq_link.middleware.RabbitMQMiddleware': 10
}
当前蜘蛛
class TestSpider(scrapy.Spider):
name = "TestSpider"
amqp_key = 'news'
def _make_request(self, mframe, hframe, body):
print('######## url ##########')
item = json.loads(body)
articleLink = item['url']
print(articleLink)
return scrapy.Request(articleLink, callback=self.parse)
def parse(self, response):
print('######## Parse article ##########')
articleBody = response.css('div.StandardArticleBody_body ::text').extract_first()
print('######## Article body ##########')
print(articleBody)
yield {
'body': articleBody
}
目前,我可以从RabbitMQ接收消息,但是不知道如何将项目保存到MongoDB
如果我添加
item = TutorialItem()
item['title'] = article.css('h3.storytitle::text').extract_first().strip()
item['body'] = articleBody.strip()
yield item
在Spider的parse
函数中,并将项保存到水管中的MongoDB中。我发现它总是向RabbitMQ添加新消息,而不是消耗消息。我觉得这与SCHEDULER = "scrapy_rabbitmq_link.scheduler.SaaS"
有什么建议吗?