我有一个庞大的字典,其中numpy数组作为值消耗几乎所有的RAM。没有可能完全腌制或压缩它。我已使用zlib
检查some solutions以使用StringIO
读取/写入块,但是当我想要读取/写入时,它们可以使用文件import zlib
class ZlibWrapper():
# chunksize is used to save memory, otherwise huge object will be copied
def __init__(self, filename, chunksize=268435456): # 256 MB
self.filename = filename
self.chunksize = chunksize
def save(self, data):
"""Saves a compressed object to disk
"""
mdata = memoryview(data)
with open(self.filename, 'wb') as f:
for i in range(0, len(mdata), self.chunksize):
mychunk = zlib.compress(bytes(mdata[i:i+self.chunksize]))
f.write(mychunk)
def load(self):
# ???
return data
等/进入RAM。
Here is the closest example to what I want,但它只有写作部分。保存这种方式后如何读取对象,因为块一起写入,压缩块当然有不同的长度?
import zlib
import json
import numpy as np
mydict = {"a":np.array([1,2,3]),"b":np.array([4,5,6]),"c":np.array([0,0,0])}
#------------
# write to compressed stream ---------------------
def string_stream_serialization(dic):
for key, val in dic.items():
#key_encoded = key.encode("utf-8") # is not json serializable
yield json.dumps([key,val.tolist()])
output = ""
compressor = zlib.compressobj()
decompressor = zlib.decompressobj()
stream = string_stream_serialization(mydict)
with open("outfile.compressed", "wb") as f:
for s in stream:
if not s:
f.write(compressor.flush())
break
f.write(compressor.compress(s.encode('utf-8'))) # .encode('utf-8') converts to bytes
# read from compressed stream: --------------------
def read_in_chunks(file_object, chunk_size=1024): # I set another chunk size intentionally
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
reconstructed = {}
with open("outfile.compressed", "rb") as f:
for s in read_in_chunks(f):
data = decompressor.decompress(decompressor.unconsumed_tail + s)
while data:
arr = json.loads(data.decode("utf-8"))
reconstructed[arr[0]] = np.array(arr[1])
data = decompressor.decompress(decompressor.unconsumed_tail)
print(reconstructed)
遗憾的是,未压缩的对象太大而无法通过网络发送,并且在外部压缩它们会产生额外的复杂性。
不幸的是,Pickle开始消耗RAM并且系统挂起。
在与Charles Duffy讨论之后,这是我的序列化尝试(目前不起作用 - 甚至不压缩字符串):
import os
from twilio.rest import Client
from twilio.http.http_client import TwilioHttpClient
proxy_client = TwilioHttpClient()
# assuming your proxy is available via the standard env var https_proxy:
## (this is the case on pythonanywhere)
proxy_client.session.proxies = {'https': os.environ['https_proxy']}
# assumes you've set up your twilio creds as env vars as well
# you can pass these in here alternatively
client = Client(http_client=proxy_client)
# twilio api calls will now work from behind the proxy:
message = client.messages.create(to="...", from_='...', body='...')
答案 0 :(得分:2)
您的第一个重点应该是采用一种理智的方式来序列化和反序列化您的数据。我们对您在问题本身或相同评论中提供的数据有一些限制:
这表明一个相当简单的实现:
def serialize(f, content):
for k,v in content.items():
# write length of key, followed by key as string
k_bstr = k.encode('utf-8')
f.write(struct.pack('L', len(k_bstr)))
f.write(k_bstr)
# write length of value, followed by value in numpy.save format
memfile = io.BytesIO()
numpy.save(memfile, v)
f.write(struct.pack('L', memfile.tell()))
f.write(memfile.getvalue())
def deserialize(f):
retval = {}
while True:
content = f.read(struct.calcsize('L'))
if not content: break
k_len = struct.unpack('L', content)[0]
k_bstr = f.read(k_len)
k = k_bstr.decode('utf-8')
v_len = struct.unpack('L', f.read(struct.calcsize('L')))[0]
v_bytes = io.BytesIO(f.read(v_len))
v = numpy.load(v_bytes)
retval[k] = v
return retval
作为一个简单的测试:
test_file = io.BytesIO()
serialize(test_file, {
"First Key": numpy.array([123,234,345]),
"Second Key": numpy.array([321,432,543]),
})
test_file.seek(0)
print(deserialize(test_file))
...所以,我们已经得到了 - 现在,我们如何添加压缩?容易。
with gzip.open('filename.gz', 'wb') as gzip_file:
serialize(gzip_file, your_data)
......或者,在减压方面:
with gzip.open('filename.gz', 'rb') as gzip_file:
your_data = deserialize(gzip_file)
这是有效的,因为gzip
库已根据请求将数据流式传输出来,而不是一次压缩或解压缩。没有必要自己做窗口和分块 - 只需将它留在下层。
答案 1 :(得分:1)
要将字典写入磁盘,zipfile模块非常合适。