Scrapy不解析make_requests_from_url循环中的响应

时间:2016-06-12 04:15:17

标签: python scrapy

我试图让scrapy从邮件队列中获取URL,然后抓取该URL。我让循环正常并从队列中获取URL,但是一旦它有一个url它就永远不会进入parse()方法,它只是继续循环(有时候url会回来,即使我和#39 ;已将其从队列中删除...)

当它在终端中运行时,如果我按CTRL + C并强制它结束,它会进入parse()方法并抓取页面,然后结束。我不确定这里有什么问题。

class my_Spider(Spider):
        name = "my_spider"
        allowed_domains = ['domain.com']

        def __init__(self):
            super(my_Spider, self).__init__()
            self.url = None

        def start_requests(self):
            while True:
                # Crawl the url from queue
                yield self.make_requests_from_url(self._pop_queue())

        def _pop_queue(self):
            # Grab the url from queue
            return self.queue()

        def queue(self):
            url = None
            while url is None:
                conf = {
                    "sqs-access-key": "",
                    "sqs-secret-key": "",
                    "sqs-queue-name": "crawler",
                    "sqs-region": "us-east-1",
                    "sqs-path": "sqssend"
                }
                # Connect to AWS
                conn = boto.sqs.connect_to_region(
                    conf.get('sqs-region'),
                    aws_access_key_id=conf.get('sqs-access-key'),
                    aws_secret_access_key=conf.get('sqs-secret-key')
                )
                q = conn.get_queue(conf.get('sqs-queue-name'))
                message = conn.receive_message(q)
                # Didn't get a message back, wait.
                if not message:
                    time.sleep(10)
                    url = None
                else:
                    url = message
            if url is not None:
                message = url[0]
                message_body = str(message.get_body())
                message.delete()
                self.url = message_body
                return self.url

        def parse(self, response):
            ...
            yield item

从评论更新:

def start_requests(self):
    while True:
        # Crawl the url from queue
        queue = self._pop_queue()
        self.logger.error(queue)
        if queue is None:
            time.sleep(10)
            continue
        url = queue
        if url:
            yield self.make_requests_from_url(url)

删除了while url is None:循环,但仍然遇到同样的问题。

1 个答案:

答案 0 :(得分:2)

如果可行的话,我是否应该假设:

import scrapy
import random

class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["example.com"]

    def __init__(self):
        super(ExampleSpider, self).__init__()
        self.url = None

    def start_requests(self):
        while True:
            # Crawl the url from queue
            yield self.make_requests_from_url(self._pop_queue())

    def _pop_queue(self):
        # Grab the url from queue
        return self.queue()

    def queue(self):
        return 'http://www.example.com/?{}'.format(random.randint(0,100000))

    def parse(self, response):
        print "Successfully parsed!"

然后您的代码也应该正常工作,除非:

  • allowed_domains存在问题,您的队列实际上会返回其外的网址
  • 您的queue()功能和/或其产生的数据存在问题,例如它返回数组或无限期地阻塞或类似

另请注意,boto库是阻塞的,而不是Twisted /异步的。为了在使用时不阻止scrapy,您必须使用与txsqs类似的Twisted兼容库。或者,您可能希望在deferToThread的单独帖子中运行boto次调用。

在Scrapy列表中跟进你的问题后,我相信你必须明白你的代码远非功能性,这使得它与普通的Boto / SQS问题一样多,就像Scrapy问题一样。无论如何 - 这是一个普通的功能解决方案。

我已经创建了具有此属性的AWS SQS:

enter image description here

然后给了它一些过于广泛的权限:

enter image description here

现在,我可以使用AWS CLi在队列中提交消息,如下所示:

$ aws --region eu-west-1 sqs send-message --queue-url "https://sqs.eu-west-1.amazonaws.com/123412341234/my_queue" --message-body 'url:https://stackoverflow.com'

出于一些奇怪的原因 - 我认为当我将--message-body设置为URL时,它实际上正在下载页面并将结果作为邮件正文发送(!)不确定 - 没有时间确认这一点,但有趣。反正。

这是一个适当的蜘蛛代码。正如我之前所说,boto阻止了API,这很糟糕。在此实现中,我仅从start_requests()调用其API一次,然后仅在idle回调中的蜘蛛spider_idle()时调用它。那时,因为蜘蛛是idleboto阻塞的事实并没有造成太大问题。当我从SQS中提取网址时,我会尽可能多地使用while循环(如果你不想一次消费,你可以在那里设置一个限制),以便拥有尽可能少地调用阻塞API。另请注意,调用conn.delete_message_batch()实际上会从队列中删除消息(否则他们只会永久保留在那里)和queue.set_message_class(boto.sqs.message.RawMessage)可以避免this问题。

总的来说,这可能是满足您的要求水平的正确解决方案。

from scrapy import Spider, Request
from scrapy import signals
import boto.sqs

from scrapy.exceptions import DontCloseSpider

class CPU_Z(Spider):
    name = "cpuz"
    allowed_domains = ['http://valid.x86.fr']

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(CPU_Z, cls).from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.spider_idle, signal=signals.spider_idle)
        return spider

    def __init__(self, *args, **kwargs):
        super(CPU_Z, self).__init__(*args, **kwargs)

        conf = {
            "sqs-access-key": "AK????????????????",
            "sqs-secret-key": "AB????????????????????????????????",
            "sqs-queue-name": "my_queue",
            "sqs-region": "eu-west-1",
        }
        self.conn = boto.sqs.connect_to_region(
            conf.get('sqs-region'),
            aws_access_key_id=conf.get('sqs-access-key'),
            aws_secret_access_key=conf.get('sqs-secret-key')
        )
        self.queue = self.conn.get_queue(conf.get('sqs-queue-name'))
        assert self.queue
        self.queue.set_message_class(boto.sqs.message.RawMessage)

    def _get_some_urs_from_sqs(self):
        while True:
            messages = self.conn.receive_message(self.queue, number_messages=10)

            if not messages:
                break

            for message in messages:
                body = message.get_body()
                if body[:4] == 'url:':
                    url = body[4:]
                    yield self.make_requests_from_url(url)

            self.conn.delete_message_batch(self.queue, messages)

    def spider_idle(self, spider):
        for request in self._get_some_urs_from_sqs():
            self.crawler.engine.crawl(request, self)

        raise DontCloseSpider()

    def start_requests(self):
        for request in self._get_some_urs_from_sqs():
            yield request

    def parse(self, response):
        yield {
            "freq_clock": response.url
        }