Scrapy:在条件下停止先前的解析功能

时间:2016-02-06 17:56:06

标签: python scrapy

我有一个非常具体的情况,我正在开发一个刮刀。第一个函数parse_posts_pages遍历特定论坛页面中的所有页面,对于每个页面,它调用第二个函数parse_posts。

def parse_posts_pages(self, response):
    thread_id = response.meta['thread_id']
    thread_link = response.meta['thread_link']
    thread_name = response.meta['thread_name']
    if len(response.xpath('//*[@id="postpagestats_above"]/text()').re(r'(\d+)')) == 3:
        posts_per_page = int(response.xpath('//*[@id="postpagestats_above"]/text()').re(r'(\d+)')[1])
        total_posts = int(response.xpath('//*[@id="postpagestats_above"]/text()').re(r'(\d+)')[2])
        if posts_per_page > 0:
            post_mod = total_posts % posts_per_page
            pages = total_posts / posts_per_page
            if post_mod > 0: pages += 1
        else: pages = 1

    for page in range(pages, 0, -1):
        cur_page = '' if page == 1 else '/page' + str(page)
        post_page_link = thread_link + cur_page
        return scrapy.Request(post_page_link, self.parse_posts, meta={'thread_id': thread_id, 'thread_name': thread_name})


def parse_posts(self, response):
    global maxPostIDByThread, executeFullSpider
    thread_id = response.meta['thread_id']
    thread_name = response.meta['thread_name']
    for post in response.xpath('//*[@id="posts"]/li'):
        post_id = post.xpath('@id').re(r'(\d.*)')[0]
        if not executeFullSpider and post_id in maxPostIDByThread:
            break #<- I need this break to also cancel the for from parse_posts_pages function
        ...

在第二个函数中有一个if条件。当这个条件解析为true时,我需要打破当前for循环以及来自parse_posts_pages的for循环,因为不需要继续分页。

有没有办法在第二个函数的第一个函数中停止for循环?

2 个答案:

答案 0 :(得分:1)

按照手册

中的说明提升CloseSpider
  

如何指示蜘蛛自行停止?

     

从回调中提升CloseSpider。

from scrapy.exceptions import CloseSpider

def parse_page(self, response):
    if 'Bandwidth exceeded' in response.body:
        raise CloseSpider('bandwidth_exceeded')

http://doc.scrapy.org/en/latest/faq.html#how-can-i-instruct-a-spider-to-stop-itself http://doc.scrapy.org/en/latest/topics/exceptions.html#scrapy.exceptions.CloseSpider

  

请注意仍在进行的请求(发送HTTP请求,   尚未收到的回复)仍将被解析。没有新的要求   尽管如此。

https://stackoverflow.com/a/23895143/5041915

更新: 实际上我发现了一些有趣的事情如果在主要功能中停止蜘蛛。

新的有效工作人员可能无法启动,因为提高异常的速度会更快。

我建议在回调函数中检查条件并尽早引发异常。

答案 1 :(得分:0)

声明一个全局parse_status变量,默认值为False。如果在第二个函数中满足所需条件,则将parse_status更改为True并在第一个函数中断开循环

    def parse_posts_pages(self, response):
    thread_id = response.meta['thread_id']
    thread_link = response.meta['thread_link']
    thread_name = response.meta['thread_name']
    if len(response.xpath('//*[@id="postpagestats_above"]/text()').re(r'(\d+)')) == 3:
        posts_per_page = int(response.xpath('//*[@id="postpagestats_above"]/text()').re(r'(\d+)')[1])
        total_posts = int(response.xpath('//*[@id="postpagestats_above"]/text()').re(r'(\d+)')[2])
        if posts_per_page > 0:
            post_mod = total_posts % posts_per_page
            pages = total_posts / posts_per_page
            if post_mod > 0: pages += 1
        else: pages = 1



    for page in range(pages, 0, -1):
            if self.parse_status == True:
                break
            cur_page = '' if page == 1 else '/page' + str(page)
            post_page_link = thread_link + cur_page
            return scrapy.Request(post_page_link, self.parse_posts, meta={'thread_id': thread_id, 'thread_name': thread_name})


def parse_posts(self, response):
    global maxPostIDByThread, executeFullSpider
    thread_id = response.meta['thread_id']
    thread_name = response.meta['thread_name']
    for post in response.xpath('//*[@id="posts"]/li'):
        post_id = post.xpath('@id').re(r'(\d.*)')[0]
        if not executeFullSpider and post_id in maxPostIDByThread:
            self.parse_status=True
            break #<- I need this break to also can