Question

我正在尝试学习Scrapy框架，并且能够编写蜘蛛并在网络上爬行等等。我还可以保存所需的数据，但不能以我想要的方式保存。

示例代码：

    import scrapy
    from scrapy.spiders import CrawlSpider, Rule
    from scrapy.linkextractors import LinkExtractor
    
    class ExampleSpider(CrawlSpider):
        name = 'examplecrawler'
        allowed_domains = ['example.com']
        start_urls = ['https://www.example/']
        rules = [
            Rule(LinkExtractor(unique=True), follow=True, callback="parse")
        ]
    
        def parse(self, response):
            url = response.url
            yield {'link': url}

当前结果：Spider递归运行，仅当我使用 Control + C

停止它时，才使用Item Exporters写入输出。

所需的结果： Spider递归运行并在运行时写入输出，而不必停止它来写入输出。

我已经通读了文档，看到可以在哪里使用类似编写自定义管道的方法来写入数据的方法，但是我想知道对于当前的项目导出器是否可以实现此目的。即：csv和json。

Answer 1

为了修改当前搜寻器的工作方式，使其打印出实时状态，您必须modify the existing code of the base class or create a crawler yourself。由于导入的是现有模块，因此您实际上无法更改其工作方式，因此最好的选择（如果不是唯一的话）是使用自定义输出创建自己的搜寻器。

爬行时抓取CrawlSpider输出

1 个答案: