Question

我正在构建一个scrapy的scraper，它应该抓取整个域，寻找破坏的EXTERNAL链接。

我有以下内容：

class domainget(CrawlSpider):
    name = 'getdomains'
    allowed_domains = ['start.co.uk']
    start_urls = ['http://www.start.co.uk']

    rules = (
        Rule(LinkExtractor('/'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        for link in LinkExtractor(allow=(), deny = self.allowed_domains).extract_links(response):
            resp = scrapy.Request(link.url, callback=self.parse_ext)


    def parse_ext(self, response):
        self.logger.info('>>>>>>>>>> Reading: %s', response.url)

当我运行此代码时，它永远不会到达parse_ext（）函数，在那里我想获取http状态代码并根据此进行进一步处理。

当我在parse_item（）函数中循环页面上提取的链接时，你可以看到我使用了parse_ext（）作为回调。

我做错了什么？

Answer 1

您没有从回调中返回Request个实例：

def parse_item(self, response):
    for link in LinkExtractor(allow=(), deny = self.allowed_domains).extract_links(response):
        yield scrapy.Request(link.url, callback=self.parse_ext)

def parse_ext(self, response):
    self.logger.info('>>>>>>>>>> Reading: %s', response.url)

如何找到外部404

1 个答案: