Python:从内存中压缩并保存/加载大数据

时间:2018-02-05 17:49:58

标签: python zlib

我有一个庞大的字典,其中numpy数组作为值消耗几乎所有的RAM。没有可能完全腌制或压缩它。我已使用zlib检查some solutions以使用StringIO读取/写入块,但是当我想要读取/写入时,它们可以使用文件import zlib class ZlibWrapper(): # chunksize is used to save memory, otherwise huge object will be copied def __init__(self, filename, chunksize=268435456): # 256 MB self.filename = filename self.chunksize = chunksize def save(self, data): """Saves a compressed object to disk """ mdata = memoryview(data) with open(self.filename, 'wb') as f: for i in range(0, len(mdata), self.chunksize): mychunk = zlib.compress(bytes(mdata[i:i+self.chunksize])) f.write(mychunk) def load(self): # ??? return data 等/进入RAM。

Here is the closest example to what I want,但它只有写作部分。保存这种方式后如何读取对象,因为块一起写入,压缩块当然有不同的长度?

import zlib

import json

import numpy as np



mydict = {"a":np.array([1,2,3]),"b":np.array([4,5,6]),"c":np.array([0,0,0])}


#------------


# write to compressed stream ---------------------

def string_stream_serialization(dic):
    for key, val in dic.items():        
        #key_encoded = key.encode("utf-8")  # is not json serializable
        yield json.dumps([key,val.tolist()])


output = ""
compressor = zlib.compressobj()
decompressor = zlib.decompressobj()


stream = string_stream_serialization(mydict)

with open("outfile.compressed", "wb") as f:
    for s in stream:
        if not s:
            f.write(compressor.flush())
            break
        f.write(compressor.compress(s.encode('utf-8'))) # .encode('utf-8') converts to bytes




# read from compressed stream: --------------------

def read_in_chunks(file_object, chunk_size=1024): # I set another chunk size intentionally
    """Lazy function (generator) to read a file piece by piece.
    Default chunk size: 1k."""
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data


reconstructed = {}

with open("outfile.compressed", "rb") as f:
    for s in read_in_chunks(f):
        data = decompressor.decompress(decompressor.unconsumed_tail + s)
        while data:
            arr = json.loads(data.decode("utf-8"))            
            reconstructed[arr[0]] = np.array(arr[1])
            data = decompressor.decompress(decompressor.unconsumed_tail)


print(reconstructed)

遗憾的是,未压缩的对象太大而无法通过网络发送,并且在外部压缩它们会产生额外的复杂性。

不幸的是,Pickle开始消耗RAM并且系统挂起。

在与Charles Duffy讨论之后,这是我的序列化尝试(目前不起作用 - 甚至不压缩字符串):

import os
from twilio.rest import Client
from twilio.http.http_client import TwilioHttpClient

proxy_client = TwilioHttpClient()

# assuming your proxy is available via the standard env var https_proxy:
## (this is the case on pythonanywhere)
proxy_client.session.proxies = {'https': os.environ['https_proxy']}

# assumes you've set up your twilio creds as env vars as well
# you can pass these in here alternatively
client = Client(http_client=proxy_client)

# twilio api calls will now work from behind the proxy:
message = client.messages.create(to="...", from_='...', body='...')

2 个答案:

答案 0 :(得分:2)

您的第一个重点应该是采用一种理智的方式来序列化和反序列化您的数据。我们对您在问题本身或相同评论中提供的数据有一些限制:

  • 您的数据包含一个包含大量键/值对的字典
  • 所有键都是unicode字符串
  • 所有值都是numpy数组,它们足够短,可以在任何给定时间轻松容纳在内存中(甚至允许任何单个值的多个副本),尽管聚合所需的存储空间变得非常大大。

这表明一个相当简单的实现:

def serialize(f, content):
    for k,v in content.items():
        # write length of key, followed by key as string
        k_bstr = k.encode('utf-8')
        f.write(struct.pack('L', len(k_bstr)))
        f.write(k_bstr)
        # write length of value, followed by value in numpy.save format
        memfile = io.BytesIO()
        numpy.save(memfile, v)
        f.write(struct.pack('L', memfile.tell()))
        f.write(memfile.getvalue())

def deserialize(f):
    retval = {}
    while True:
        content = f.read(struct.calcsize('L'))
        if not content: break
        k_len = struct.unpack('L', content)[0]
        k_bstr = f.read(k_len)
        k = k_bstr.decode('utf-8')
        v_len = struct.unpack('L', f.read(struct.calcsize('L')))[0]
        v_bytes = io.BytesIO(f.read(v_len))
        v = numpy.load(v_bytes)
        retval[k] = v
    return retval

作为一个简单的测试:

test_file = io.BytesIO()
serialize(test_file, {
    "First Key": numpy.array([123,234,345]),
    "Second Key": numpy.array([321,432,543]),
})

test_file.seek(0)
print(deserialize(test_file))

...所以,我们已经得到了 - 现在,我们如何添加压缩?容易。

with gzip.open('filename.gz', 'wb') as gzip_file:
    serialize(gzip_file, your_data)

......或者,在减压方面:

with gzip.open('filename.gz', 'rb') as gzip_file:
    your_data = deserialize(gzip_file)

这是有效的,因为gzip库已根据请求将数据流式传输出来,而不是一次压缩或解压缩。没有必要自己做窗口和分块 - 只需将它留在下层。

答案 1 :(得分:1)

要将字典写入磁盘,zipfile模块非常合适。

  • 保存时 - 将每个块保存为zip文件。
  • 加载时 - 迭代zip中的文件并重建数据。