如何关闭scrapy的ImagesPipeline以自动创建完整文件夹?

时间:2018-10-23 06:02:29

标签: python scrapy

当使用scrapy的ImagesPipeline下载图像时,我已经设置了保存路径,但是我仍然会在保存路径中为我创建一个新的完整文件夹。我不希望它为我创造全部。我该如何关闭? 我在scrapy的settings.py中设置了图像存储的路径。

IMAGES_STORE ='F:/test/exp'

当我的搜寻器抓取数据时,图像会保存在“ F:/ test / exp / full”路径中。我不希望我的程序为我创建此“完整”文件夹。但是直接将其保存在我设置的路径中

1 个答案:

答案 0 :(得分:1)

很遗憾,路径的后缀Exception in thread "main" org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Exchange SinglePartition +- *(3) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#218L]) +- *(3) Project +- *(3) BroadcastHashJoin [cast(sid#30 as bigint), reid#26L, substring(cast(ra#27L as string), -13, (length(cast(ra#27L as string)) - 7))], [sid#75L, reid#73L, substring(cast(ra#74L as string), -13, (length(cast(ra#74L as string)) - 7))], LeftOuter, BuildRight :- *(3) Project [reid#26L, ra#27L, sid#30] : +- *(3) BroadcastHashJoin [cast(sid#30 as bigint), reid#26L, substring(cast(ra#27L as string), -13, (length(cast(ra#27L as string)) - 7))], [sid#101L, reid#99L, substring(cast(ra#100L as string), -13, (length(cast(ra#100L as string)) - 7))], LeftOuter, BuildRight : :- *(3) Project [reid#26L, ra#27L, sid#30] : : +- *(3) Filter isnotnull(ppbc#4) : : +- *(3) FileScan parquet [ppbc#4,reid#26L,ra#27L,sid#30] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://bucket/folder/parquet/day=2018-10-22], PartitionFilters: [], PushedFilters: [IsNotNull(pbbc)], ReadSchema: struct<ppbc:string,reid:bigint,ra:bigint,sid:string> : +- BroadcastExchange HashedRelationBroadcastMode(List(input[2, bigint, true], input[0, bigint, true], substring(cast(input[1, bigint, true] as string), -13, (length(cast(input[1, bigint, true] as string)) - 7)))) : +- *(1) Project [reid#99L, ra#100L, sid#101L] : +- *(1) FileScan json [reid#99L,ra#100L,sid#101L,day#103] Batched: false, Format: JSON, Location: InMemoryFileIndex[s3://bucket/folder/pathtodata], PartitionCount: 69, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<reid:bigint,ra:bigint,sid:bigint> +- BroadcastExchange HashedRelationBroadcastMode(List(input[2, bigint, true], input[0, bigint, true], substring(cast(input[1, bigint, true] as string), -13, (length(cast(input[1, bigint, true] as string)) - 7)))) +- *(2) Project [reid#73L, ra#74L, sid#75L] +- *(2) FileScan json [reid#73L,ra#74L,sid#75L,day#77] Batched: false, Format: JSON, Location: InMemoryFileIndex[s3://bucket/folder/pathtodata2], PartitionCount: 11, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<reid:bigint,ra:bigint,sid:bigint> 在管道中硬编码:

full

不过,您可以通过扩展# from scrapy 1.5.1 source code def file_path(self, request, response=None, info=None): ... image_guid = hashlib.sha1(to_bytes(url)).hexdigest() return 'full/%s.jpg' % (image_guid) 来解决此问题,并创建文件ImagesPipeline

myproject/pipelines.py

并激活它,而不是在class RootImagesPipeline(ImagesPipeline): """changes /full/ path to root""" def file_path(self, request, response=None, info=None): """This is the method used to determine file path""" path = super().file_path(request, response, info) return path.replace('full/', '') 中启动scrapy的管道:

settings.py