Question

在抓取抓取的网站后，我在关闭方法中创建了一个zip存档，将图片拉入其中。然后，我将有效的json文件添加到存档中。

解压缩后（在Mac OS X或ubuntu上），json文件将显示损坏。最后一项丢失。

解压缩文件的结尾：

..a46.jpg"]},

原始文件：

a46.jpg"]}]

代码：

# create zip archive with all images inside
filename = '../zip/' + datetime.datetime.now().strftime ("%Y%m%d-%H%M") + '_' + name
imagefolder = 'full'
imagepath = '/Users/user/test_crawl/bid/images'
shutil.make_archive(
    filename, 
    'zip', 
    imagepath,
    imagefolder
) 

# add json file to zip archive
filename_zip = filename + '.zip'
zip = zipfile.ZipFile(filename_zip,'a') 
path_to_file = '/Users/user/test_crawl/bid/data/'+  
datetime.datetime.now().strftime ("%Y%m%d") + '_' + name + '.json'
zip.write(path_to_file, os.path.basename(path_to_file)) 
zip.close()

我可以多次重现此错误，其他所有内容都可以。

Answer 1

解决方案是使用scrapy jsonitemexporter而不是fead exporter，因为feed导出器将在close_spider（）期间写入文件，这已经很晚了。

这很简单。

将JsonItemExporter加载到文件pipelines.py中

from scrapy.exporters import JsonItemExporter

像这样更改管道：

class MyPipeline(object):

    file = None

    def open_spider(self, spider):
        self.file = open('data/test.json', 'wb')
        self.exporter = JsonItemExporter(self.file)
        self.exporter.start_exporting()

    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.file.close()
        cleanup('zip_method')

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

zip_method包含问题中提到的邮政编码。

json文件在使用python放入zip存档时遭到损坏

1 个答案: