我试图让scrapy从邮件队列中获取URL,然后抓取该URL。我让循环正常并从队列中获取URL,但是一旦它有一个url它就永远不会进入parse()
方法,它只是继续循环(有时候url会回来,即使我和#39 ;已将其从队列中删除...)
当它在终端中运行时,如果我按CTRL + C并强制它结束,它会进入parse()
方法并抓取页面,然后结束。我不确定这里有什么问题。
class my_Spider(Spider):
name = "my_spider"
allowed_domains = ['domain.com']
def __init__(self):
super(my_Spider, self).__init__()
self.url = None
def start_requests(self):
while True:
# Crawl the url from queue
yield self.make_requests_from_url(self._pop_queue())
def _pop_queue(self):
# Grab the url from queue
return self.queue()
def queue(self):
url = None
while url is None:
conf = {
"sqs-access-key": "",
"sqs-secret-key": "",
"sqs-queue-name": "crawler",
"sqs-region": "us-east-1",
"sqs-path": "sqssend"
}
# Connect to AWS
conn = boto.sqs.connect_to_region(
conf.get('sqs-region'),
aws_access_key_id=conf.get('sqs-access-key'),
aws_secret_access_key=conf.get('sqs-secret-key')
)
q = conn.get_queue(conf.get('sqs-queue-name'))
message = conn.receive_message(q)
# Didn't get a message back, wait.
if not message:
time.sleep(10)
url = None
else:
url = message
if url is not None:
message = url[0]
message_body = str(message.get_body())
message.delete()
self.url = message_body
return self.url
def parse(self, response):
...
yield item
从评论更新:
def start_requests(self):
while True:
# Crawl the url from queue
queue = self._pop_queue()
self.logger.error(queue)
if queue is None:
time.sleep(10)
continue
url = queue
if url:
yield self.make_requests_from_url(url)
删除了while url is None:
循环,但仍然遇到同样的问题。
答案 0 :(得分:2)
如果可行的话,我是否应该假设:
import scrapy
import random
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["example.com"]
def __init__(self):
super(ExampleSpider, self).__init__()
self.url = None
def start_requests(self):
while True:
# Crawl the url from queue
yield self.make_requests_from_url(self._pop_queue())
def _pop_queue(self):
# Grab the url from queue
return self.queue()
def queue(self):
return 'http://www.example.com/?{}'.format(random.randint(0,100000))
def parse(self, response):
print "Successfully parsed!"
然后您的代码也应该正常工作,除非:
allowed_domains
存在问题,您的队列实际上会返回其外的网址queue()
功能和/或其产生的数据存在问题,例如它返回数组或无限期地阻塞或类似另请注意,boto
库是阻塞的,而不是Twisted /异步的。为了在使用时不阻止scrapy,您必须使用与txsqs类似的Twisted兼容库。或者,您可能希望在deferToThread的单独帖子中运行boto
次调用。
在Scrapy列表中跟进你的问题后,我相信你必须明白你的代码远非功能性,这使得它与普通的Boto / SQS问题一样多,就像Scrapy问题一样。无论如何 - 这是一个普通的功能解决方案。
我已经创建了具有此属性的AWS SQS:
然后给了它一些过于广泛的权限:
现在,我可以使用AWS CLi在队列中提交消息,如下所示:
$ aws --region eu-west-1 sqs send-message --queue-url "https://sqs.eu-west-1.amazonaws.com/123412341234/my_queue" --message-body 'url:https://stackoverflow.com'
出于一些奇怪的原因 - 我认为当我将--message-body
设置为URL时,它实际上正在下载页面并将结果作为邮件正文发送(!)不确定 - 没有时间确认这一点,但有趣。反正。
这是一个适当的蜘蛛代码。正如我之前所说,boto
阻止了API,这很糟糕。在此实现中,我仅从start_requests()
调用其API一次,然后仅在idle
回调中的蜘蛛spider_idle()
时调用它。那时,因为蜘蛛是idle
,boto
阻塞的事实并没有造成太大问题。当我从SQS中提取网址时,我会尽可能多地使用while
循环(如果你不想一次消费,你可以在那里设置一个限制),以便拥有尽可能少地调用阻塞API。另请注意,调用conn.delete_message_batch()
实际上会从队列中删除消息(否则他们只会永久保留在那里)和queue.set_message_class(boto.sqs.message.RawMessage)
可以避免this问题。
总的来说,这可能是满足您的要求水平的正确解决方案。
from scrapy import Spider, Request
from scrapy import signals
import boto.sqs
from scrapy.exceptions import DontCloseSpider
class CPU_Z(Spider):
name = "cpuz"
allowed_domains = ['http://valid.x86.fr']
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(CPU_Z, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_idle, signal=signals.spider_idle)
return spider
def __init__(self, *args, **kwargs):
super(CPU_Z, self).__init__(*args, **kwargs)
conf = {
"sqs-access-key": "AK????????????????",
"sqs-secret-key": "AB????????????????????????????????",
"sqs-queue-name": "my_queue",
"sqs-region": "eu-west-1",
}
self.conn = boto.sqs.connect_to_region(
conf.get('sqs-region'),
aws_access_key_id=conf.get('sqs-access-key'),
aws_secret_access_key=conf.get('sqs-secret-key')
)
self.queue = self.conn.get_queue(conf.get('sqs-queue-name'))
assert self.queue
self.queue.set_message_class(boto.sqs.message.RawMessage)
def _get_some_urs_from_sqs(self):
while True:
messages = self.conn.receive_message(self.queue, number_messages=10)
if not messages:
break
for message in messages:
body = message.get_body()
if body[:4] == 'url:':
url = body[4:]
yield self.make_requests_from_url(url)
self.conn.delete_message_batch(self.queue, messages)
def spider_idle(self, spider):
for request in self._get_some_urs_from_sqs():
self.crawler.engine.crawl(request, self)
raise DontCloseSpider()
def start_requests(self):
for request in self._get_some_urs_from_sqs():
yield request
def parse(self, response):
yield {
"freq_clock": response.url
}