Question

我想抓取一个API。在连续10x 404页之后，Spider应该停止，因为它很可能到达了我列表的末尾。同时，由于删除了事件/ pk，我的Spider可以处理404个页面。

当前，对于每个解析的URL，我的计数器始终从0开始。那不是我想要的。

class EventSpider(scrapy.Spider):
    handle_httpstatus_list = [404]  # TODO: Move to middleware?
    name = "eventpage"
    start_urls = ['https://www.eventwebsite.com/api-internal/v1/events/%s/?format=json' % page for page in range(1,12000)]

    def parse(self, response):
        # Accept X 404 error until stop processing
        count_404 = 0

        print("################", response.status, "################")
        if response.status == 404:
            count_404 += 1
            print("404 Counter: ", count_404)
        print("################################")
        if count_404 == 10:
            break  # Stop scraping

Scrapy：10x 404后停止抓取

0 个答案: