Question

我正在尝试将Scrapy用作使用RabbitMQ的消费者。

这是我的代码段：

def runTester(body):
    spider = MySpider(domain=body["url"], body=body)
    settings = get_project_settings()
    crawler = Crawler(settings)
    crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
    crawler.configure()
    crawler.crawl(spider)
    crawler.start()
    log.start()
    reactor.run()


def callback(ch, method, properties, body):
    body = json.loads(body)
    runTester(body)
    ch.basic_ack(delivery_tag=method.delivery_tag)

if __name__ == '__main__':
    connection = pika.BlockingConnection(pika.ConnectionParameters(host=settings.RABBITMQ_HOST))
    channel = connection.channel()
    channel.queue_declare(queue=settings.RABBITMQ_TESTER_QUEUE, durable=True)
    channel.basic_qos(prefetch_count=1)
    channel.basic_consume(callback, queue=settings.RABBITMQ_TESTER_QUEUE)
    channel.start_consuming()

正如您所看到的，问题是当第一条消息被消耗并且蜘蛛运行时，反应堆停机。这是什么解决方法？

我希望能够保持反应堆运行，同时在从RabbitMQ收到消息时始终保持运行新的爬虫。

Answer 1

更好的方法是使用scrapy daemon api启动蜘蛛，在收到蜘蛛请求后，您将使用curl这样：

reply = {}
args = ['curl',
        'http://localhost:6800/schedule.json',
        '-d', 'project=myproject', ] + flat_args
json_reply = subprocess.Popen(args, stdout=subprocess.PIPE).communicate()[0]
try:
    reply = json.loads(json_reply)
    if reply['status'] != 'ok':
        logger.error('Error in spider: %r: %r.', args, reply)
    else:
        logger.debug('Started spider: %r: %r.', args, reply)
except Exception:
    logger.error('Error starting spider: %r: %r.', args, json_reply)
return reply

什么会启动实际会执行的子流程：

$ curl http://localhost:6800/schedule.json -d project=myproject -d spider=somespider

scrapy守护程序用于管理蜘蛛启动，并具有许多其他有用的功能，例如使用简单的scrapy deploy命令部署新的蜘蛛版本，监视和平衡多个蜘蛛等。

使用Scrapy作为RabbitMQ的消费者

1 个答案: