我正在创建一个非常大的数组。我希望能够将它写入文件,而不是将此数组存储在内存中。这需要采用我以后可以导入的格式。
我会使用pickle,但看起来pickle用于完成的文件结构。
在下面的示例中,我需要一种方法将out变量作为文件而不是内存存储对象:
out = []
for x in y:
z = []
#get lots of data into z
out.append(z)
答案 0 :(得分:2)
streaming-pickle允许您以流(增量)方式将一系列Python数据结构保存/加载到磁盘,从而使用比常规pickle少得多的内存。
它实际上只是一个包含三个简短方法的文件。我添加了snippet with an example:
try:
from cPickle import dumps, loads
except ImportError:
from pickle import dumps, loads
def s_dump(iterable_to_pickle, file_obj):
""" dump contents of an iterable iterable_to_pickle to file_obj, a file
opened in write mode """
for elt in iterable_to_pickle:
s_dump_elt(elt, file_obj)
def s_dump_elt(elt_to_pickle, file_obj):
""" dumps one element to file_obj, a file opened in write mode """
pickled_elt_str = dumps(elt_to_pickle)
file_obj.write(pickled_elt_str)
# record separator is a blank line
# (since pickled_elt_str might contain its own newlines)
file_obj.write('\n\n')
def s_load(file_obj):
""" load contents from file_obj, returning a generator that yields one
element at a time """
cur_elt = []
for line in file_obj:
cur_elt.append(line)
if line == '\n':
pickled_elt_str = ''.join(cur_elt)
elt = loads(pickled_elt_str)
cur_elt = []
yield elt
以下是您可以使用它的方法:
from __future__ import print_function
import os
import sys
if __name__ == '__main__':
if os.path.exists('obj.serialized'):
# load a file 'obj.serialized' from disk and
# spool through iterable
with open('obj.serialized', 'r') as handle:
_generator = s_load(handle)
for element in _generator:
print(element)
else:
# or create it first, otherwise
with open('obj.serialized', 'w') as handle:
for i in xrange(100000):
s_dump_elt({'i' : i}, handle)
答案 1 :(得分:1)
HDF5也许?它得到了相当广泛的支持,并允许你append to existing datasets。
答案 2 :(得分:0)
我可以想象你使用字符串酸洗并加上一个长度指示器:
import os
import struct
import pickle # or cPickle
def loader(inf):
while True:
s = inf.read(4)
if not s: return
length, = struct.unpack(">L", s)
data = inf.read(length)
yield pickle.loads(data)
if __name__ == '__main__':
if os.path.exists('dumptest'):
# load file
with open('dumptest', 'rb') as inf:
for element in loader(inf):
print element
else:
# or create it first, otherwise
with open('dumptest', 'wb') as outf:
for i in xrange(100000):
dump = pickle.dumps({'i' : i}, protocol=-1) # or whatever you want as protocol...
lenstr = struct.pack(">L", len(dump))
outf.write(lenstr + dump)
这不会缓存任何超过实际需要的数据,将项目彼此分开,并且还与所有酸洗协议兼容。