Question

在抓取完成后，我需要测试所有抓取的数据（某些字段的可用性百分比等）。数据包含在一个csv文件中，因此为了进行测试，我决定使用Pandas。在Scrapy告诉我抓取完成之后，是否有任何方法可以启动代码来测试刮spider蜘蛛中的.csv文件？我尝试使用扩展，但无法使其正常工作。谢谢

class Spider(scrapy.Spider):
    name = 'scrapyspider'
    allowed_domains = ['www.example.com']
    start_urls = ['https://www.example.com/1/', 'https://www.example.com/2/']


    def parse(self, response):
        for product_link in response.xpath(
                '//a[@class="product-link"]/@href').extract():
            absolute_url = response.urljoin(product_link)
            yield scrapy.Request(absolute_url, self.parse_product)
        for category_link in response.xpath(
            '//a[@class="navigation-item-link"]/@href').extract():
            absolute_url = response.urljoin(category_link)
            yield scrapy.Request(absolute_url, self.parse)

    def parse_product(self, response):
        ...
        yield item

Answer 1

Scrapy为您提供了控制Pipelines中的项目的流程

在Pipelines中，您可以验证或可以对该项进行任何检查，如果该项不符合您的标准，或者您希望根据某些属性值更新数据，则可以在其中进行。

有关Pipelines的更多信息，您可以阅读here

抓取完成后运行自定义代码（草率）

1 个答案: