如果在Scrapy中遇到某些条件,如何摆脱爬网

时间:2018-11-13 14:51:56

标签: python scrapy web-crawler

出于我正在处理的应用程序的目的,我需要抓紧时间才能脱离抓取范围,并从特定的任意URL开始再次抓取。

预期的行为是使抓取行为返回到特定的URL,如果满足特定条件,则可以在参数中提供该URL。

我正在使用CrawlSpider,但不知道如何实现:

class MyCrawlSpider(CrawlSpider):
    name = 'mycrawlspider'
    initial_url = ""

    def __init__(self, initial_url, *args, **kwargs):
        self.initial_url = initial_url
        domain = "mydomain.com"
        self.start_urls = [initial_url]
        self.allowed_domains = [domain]
        self.rules = (
            Rule(LinkExtractor(allow=[r"^http[s]?://(www.)?" + domain + "/.*"]), callback='parse_item', follow=True),
        )

        super(MyCrawlSpider, self)._compile_rules()


    def parse_item(self, response):
        if(some_condition is True):
            # force scrapy to go back to home page and recrawl
            print("Should break out")

        else:
           print("Just carry on")

我试图放置

return scrapy.Request(self.initial_url, callback=self.parse_item)

someCondition is True的分支中,但没有成功。非常感谢您的帮助,我们一直在努力设法解决这一问题。

1 个答案:

答案 0 :(得分:0)

您可以设置一个适当处理的自定义异常,就像这样...

请随时使用适用于CrawlSpider的语法进行编辑

class RestartException(Exception):
    pass

class MyCrawlSpider(CrawlSpider):
    name = 'mycrawlspider'
    initial_url = ""

    def __init__(self, initial_url, *args, **kwargs):
        self.initial_url = initial_url
        domain = "mydomain.com"
        self.start_urls = [initial_url]
        self.allowed_domains = [domain]
        self.rules = (
            Rule(LinkExtractor(allow=[r"^http[s]?://(www.)?" + domain + "/.*"]), callback='parse_item', follow=True),
        )

        super(MyCrawlSpider, self)._compile_rules()


    def parse_item(self, response):
        if(some_condition is True):

            print("Should break out")
            raise RestartException("We're restarting now")

        else:
           print("Just carry on")

siteName = "http://whatever.com"
crawler = MyCrawlSpider(siteName)           
while True:
    try:
        #idk how you start this thing, but do that

        crawler.run()
        break
    except RestartException as err:
        print(err.args)
        crawler.something = err.args
        continue

print("I'm done!")