Question

我正在阅读其官方网页上的scrapy教程：https://doc.scrapy.org/en/latest/intro/tutorial.html

这是令我困惑的代码：

import scrapy


class AuthorSpider(scrapy.Spider):
    name = 'author'

    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        # follow links to author pages
        for href in response.css('.author + a::attr(href)'):
            yield response.follow(href, self.parse_author)

        # follow pagination links
        for href in response.css('li.next a::attr(href)'):
            yield response.follow(href, self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).extract_first().strip()

        yield {
            'name': extract_with_css('h3.author-title::text'),
            'birthdate': extract_with_css('.author-born-date::text'),
            'bio': extract_with_css('.author-description::text'),
        }

关键点是在函数parse_author(self, response)中定义的以下函数：

def extract_with_css(query):
    return response.css(query).extract_first().strip()

正如教程所说，parse_author回调定义了一个辅助函数来从CSS查询中提取和清理数据。有人可以帮忙理解这个吗？什么时候会被召唤？

scrapy AuthorSpider示例混淆

0 个答案: