Question

我是Scrapy的新手，请耐心等待。

我有一个访问页面的蜘蛛，并下载文件。最后，我想将文件的名称以及其他有用信息写入db表。

如果实际下载了文件（而不是'uptodate'），我只想将信息写入db表

- ＆GT;现在，我正在努力找出文件是否已被下载或是“uptodate”。

如果下载文件，请从日志中看到：

2017-08-22 17:25:16 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
....,
'downloader/response_count': 5,
'downloader/response_status_count/200': 5,
'file_count': 1,
-->'file_status_count/downloaded': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 8, 22, 16, 25, 16, 789000),
'item_scraped_count': 1,
'log_count/DEBUG': 8,
'log_count/INFO': 7,
'request_depth_max': 1,
....
2017-08-22 17:25:16 [scrapy.core.engine] INFO: Spider closed (finished)

如果已经下载了文件，Scrapy将不会再次下载该文件，并且它有一个如下所示的日志：

2017-08-22 17:32:49 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
...,
'downloader/response_status_count/200': 4,
'file_count': 1,
-->'file_status_count/uptodate': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 8, 22, 16, 32, 49, 787000),
'item_scraped_count': 1,
'log_count/DEBUG': 7,
'log_count/INFO': 7,
...
2017-08-22 17:32:49 [scrapy.core.engine] INFO: Spider closed (finished)

我希望获得下载状态。

我已经看过scrapy代码了，我认为我之后的函数是在pipelines文件夹的files.py中的'inc_stats'：

def inc_stats(self, spider, status):
    spider.crawler.stats.inc_value('file_count', spider=spider)
    spider.crawler.stats.inc_value('file_status_count/%s' % status, spider=spider)

如何将实际Scrapy代码中的信息（'下载'或'uptodate'）提取到我的蜘蛛？

非常感谢您的帮助

Answer 1

您必须覆盖FilesPipeline，并自行创建一个覆盖inc_stats方法。

您应该在settings.py：

中包含类似内容

ITEM_PIPELINES = {
    ...
    'scrapy.pipelines.files.FilesPipeline': 1
    ...
}

这会启用FilesPipeline附带的默认scrapy，您可以创建自己的管道。在pipelines.py内（或任何你想要的地方）创建一个这样的类：

from scrapy.pipelines.files import FilesPipeline

class CustomFilesPipeline(FilesPipeline):
    def inc_stats(self, spider, status):
        super(CustomFilesPipeline, self).inc_stats(spider=spider, status=status)
        if status == 'downloaded':
            # do whatever you want

要启用该管道而不是scrapy，请将settings.py更改为：

ITEM_PIPELINES = {
    ...
    'myproject.pipelines.CustomFilesPipeline': 1
    ...
}

检查myproject.pipelines.CustomFilesPipeline是否是项目中管道类的路径。

Answer 2

您无法在您的蜘蛛代码中获取有关文件下载的信息，因为下载发生在文件管道中，因此在您的蜘蛛处理该项目之后。

但是，您应该能够继承标准FilesPipeline类并覆盖item_completed方法。在该方法中，您可以从results和info参数中获取一些有用的信息，并在返回之前将它们存储在您的项目中。这样，该信息将可用于在您的文件管道之后订购的其他管道。我还没有测试过这种方法，但我相信它可行。

Scrapy：如何获取文件下载状态

2 个答案: