Question

我有一个有效的代码（在大多数情况下），用于抓取电子商务网站。我从URL开始，然后爬网主要类别，然后再深入一位律师对子类别进行爬网，然后再次执行相同的操作，直到进入产品页面为止。

它看起来像这样：

class ExampleSpider(scrapy.Spider):
    name = "example_bot"  # how we have to call the bot
    start_urls = ["https://......html"]

def parse(self, response):
    for link in response.css('div.mvNavSub ul li a::attr(href)').extract():
        yield response.follow(link, callback = self.parse_on_categories) #going to one layer deep from landing page

def parse_on_categories(self, response):
    for link in response.css('div.mvNavSub ul li a::attr(href)').extract():
        yield response.follow(link, callback = self.parse_on_subcategories) #going to two layer deep from landing page

def parse_on_subcategories(self, response):
    (same code than above)

def parse_data(self, response):
    (parse data)

与网站的某些部分相比，我注意到，我必须更深入地研究子类别才能解析产品。由于我一直在重复使用相同的代码来爬网类别，所以我想知道是否有可能只重用第一个函数，直到没有更多的类别可以爬网。这是我尝试过的：

def parse(self, response):
    for link in response.css('div.mvNavSub ul li a::attr(href)').extract():
        yield response.follow(link, callback = self.parse_on_categories)

def parse_on_categories(self, response):
    if response.css('div.mvNavSub ul li a::attr(href)').extract(): # if there is categories to crawl
        self.parse(response)
    else:
        self.parse_data(response)

def parse_data(self, response):

如果有要爬网的类别，我希望parse_on_categories调用第一个函数。如果没有，则应调用parse_data。

但是现在我无法使其正常工作，因此，如果您能使我走上正轨，将不胜感激:)谢谢

Answer 1

您必须产生从parse（）和parse_data（）方法接收的任何内容。

def parse_on_categories(self, response):
    if response.css('div.mvNavSub ul li a::attr(href)').extract():
        callback = self.parse
    else:
        callback = self.parse_data

    for r in callback(response):
        yield r

抓/解析具有相同功能的多个类别和子类别

1 个答案: