Scrapy关闭蜘蛛,如果没有网址爬行

时间:2017-07-17 12:09:38

标签: python python-2.7 scrapy

我有一个蜘蛛从redis列表中获取url。

我想在找不到网址时很好地关闭蜘蛛。我试图实现CloseSpider异常,但似乎没有达到这一点

def start_requests(self):
    while True:
        item = json.loads(self.__pop_queue())
        if not item:
            raise CloseSpider("Closing spider because no more urls to crawl")
        try:
            yield scrapy.http.Request(item['product_url'], meta={'item': item})
        except ValueError:
            continue

即使我提出了CloseSpider异常,但我仍然收到以下错误:

root@355e42916706:/scrapper# scrapy crawl general -a country=my -a log=file
2017-07-17 12:05:13 [scrapy.core.engine] ERROR: Error while obtaining start requests
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/scrapy/core/engine.py", line 127, in _next_request
    request = next(slot.start_requests)
  File "/scrapper/scrapper/spiders/GeneralSpider.py", line 20, in start_requests
    item = json.loads(self.__pop_queue())
  File "/usr/local/lib/python2.7/json/__init__.py", line 339, in loads
    return _default_decoder.decode(s)
  File "/usr/local/lib/python2.7/json/decoder.py", line 364, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
TypeError: expected string or buffer

此外,我也尝试在同一个函数中捕获TypeError,但它也不起作用。

有没有推荐的方法来处理这个

由于

2 个答案:

答案 0 :(得分:4)

在将self.__pop_queue()提交给json.loads()之前,您需要检查TypeError是否会返回某些内容(或者在调用时{}返回def start_requests(self): while True: item = self.__pop_queue() if not item: raise CloseSpider("Closing spider because no more urls to crawl") try: item = json.loads(item) yield scrapy.http.Request(item['product_url'], meta={'item': item}) except (ValueError, TypeError): # just in case the 'item' is not a string or buffer continue ),例如:

    <div id="fb-root"></div> <script>(function(d, s, id) { var js, fjs = d.getElementsByTagName(s)[0]; if (d.getElementById(id)) return; js = d.createElement(s); js.id = id; js.src = "//connect.facebook.net/en_US/all.js#xfbml=1"; fjs.parentNode.insertBefore(js, fjs); }(document, 'script', 'facebook-jssdk'));</script>
<div class="fb-video" data-href="https://www.facebook.com/somevide/videos/778978567/" data-width="640" data-show-text="false"><div class="fb-xfbml-parse-ignore"></div></div>

答案 1 :(得分:1)

我遇到了同样的问题并找到了一个小技巧。当蜘蛛处于空闲(当它什么都不做的时候)时,我会检查redis队列中是否还有剩余的东西。如果不是,我用close_spider关闭蜘蛛。以下代码位于spider类:

@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
    from_crawler = super(SerpSpider, cls).from_crawler
    spider = from_crawler(crawler, *args, **kwargs)
    crawler.signals.connect(spider.idle, signal=scrapy.signals.spider_idle)
    return spider


def idle(self):
    if self.q.llen(self.redis_key) <= 0:
        self.crawler.engine.close_spider(self, reason='finished')