当使用scrapy的ImagesPipeline下载图像时,我已经设置了保存路径,但是我仍然会在保存路径中为我创建一个新的完整文件夹。我不希望它为我创造全部。我该如何关闭? 我在scrapy的settings.py中设置了图像存储的路径。
IMAGES_STORE ='F:/test/exp'
当我的搜寻器抓取数据时,图像会保存在“ F:/ test / exp / full”路径中。我不希望我的程序为我创建此“完整”文件夹。但是直接将其保存在我设置的路径中
答案 0 :(得分:1)
很遗憾,路径的后缀Exception in thread "main" org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange SinglePartition
+- *(3) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#218L])
+- *(3) Project
+- *(3) BroadcastHashJoin [cast(sid#30 as bigint), reid#26L, substring(cast(ra#27L as string), -13, (length(cast(ra#27L as string)) - 7))], [sid#75L, reid#73L, substring(cast(ra#74L as string), -13, (length(cast(ra#74L as string)) - 7))], LeftOuter, BuildRight
:- *(3) Project [reid#26L, ra#27L, sid#30]
: +- *(3) BroadcastHashJoin [cast(sid#30 as bigint), reid#26L, substring(cast(ra#27L as string), -13, (length(cast(ra#27L as string)) - 7))], [sid#101L, reid#99L, substring(cast(ra#100L as string), -13, (length(cast(ra#100L as string)) - 7))], LeftOuter, BuildRight
: :- *(3) Project [reid#26L, ra#27L, sid#30]
: : +- *(3) Filter isnotnull(ppbc#4)
: : +- *(3) FileScan parquet [ppbc#4,reid#26L,ra#27L,sid#30] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://bucket/folder/parquet/day=2018-10-22], PartitionFilters: [], PushedFilters: [IsNotNull(pbbc)], ReadSchema: struct<ppbc:string,reid:bigint,ra:bigint,sid:string>
: +- BroadcastExchange HashedRelationBroadcastMode(List(input[2, bigint, true], input[0, bigint, true], substring(cast(input[1, bigint, true] as string), -13, (length(cast(input[1, bigint, true] as string)) - 7))))
: +- *(1) Project [reid#99L, ra#100L, sid#101L]
: +- *(1) FileScan json [reid#99L,ra#100L,sid#101L,day#103] Batched: false, Format: JSON, Location: InMemoryFileIndex[s3://bucket/folder/pathtodata], PartitionCount: 69, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<reid:bigint,ra:bigint,sid:bigint>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[2, bigint, true], input[0, bigint, true], substring(cast(input[1, bigint, true] as string), -13, (length(cast(input[1, bigint, true] as string)) - 7))))
+- *(2) Project [reid#73L, ra#74L, sid#75L]
+- *(2) FileScan json [reid#73L,ra#74L,sid#75L,day#77] Batched: false, Format: JSON, Location: InMemoryFileIndex[s3://bucket/folder/pathtodata2], PartitionCount: 11, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<reid:bigint,ra:bigint,sid:bigint>
在管道中硬编码:
full
不过,您可以通过扩展# from scrapy 1.5.1 source code
def file_path(self, request, response=None, info=None):
...
image_guid = hashlib.sha1(to_bytes(url)).hexdigest()
return 'full/%s.jpg' % (image_guid)
来解决此问题,并创建文件ImagesPipeline
:
myproject/pipelines.py
并激活它,而不是在class RootImagesPipeline(ImagesPipeline):
"""changes /full/ path to root"""
def file_path(self, request, response=None, info=None):
"""This is the method used to determine file path"""
path = super().file_path(request, response, info)
return path.replace('full/', '')
中启动scrapy的管道:
settings.py