将Numpy对象保存到拼花地板的传统方法是使用Pandas作为中介。但是,我正在处理大量数据,这在不破坏环境的情况下不适合熊猫使用,因为在熊猫中,数据占用大量RAM。
我需要保存到Parquet,因为我正在使用numpy中的可变长度数组,因此该Parquet实际上节省的空间比.npy或.hdf5小。
下面的代码是一个最小的示例,该示例下载我的一小部分数据,并在pandas对象和numpy对象之间转换以测量它们消耗的RAM,并保存到npy和Parquet文件以查看它们占用了多少磁盘空间。 。
# Download sample file, about 10 mbs
from sys import getsizeof
import requests
import pickle
import numpy as np
import pandas as pd
import os
def download_file_from_google_drive(id, destination):
URL = "https://docs.google.com/uc?export=download"
session = requests.Session()
response = session.get(URL, params = { 'id' : id }, stream = True)
token = get_confirm_token(response)
if token:
params = { 'id' : id, 'confirm' : token }
response = session.get(URL, params = params, stream = True)
save_response_content(response, destination)
def get_confirm_token(response):
for key, value in response.cookies.items():
if key.startswith('download_warning'):
return value
return None
def save_response_content(response, destination):
CHUNK_SIZE = 32768
with open(destination, "wb") as f:
for chunk in response.iter_content(CHUNK_SIZE):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
download_file_from_google_drive('1-0R28Yhdrq2QWQ-4MXHIZUdZG2WZK2qR', 'sample.pkl')
sampleDF = pd.read_pickle('sample.pkl')
sampleDF.to_parquet( 'test1.pqt', compression = 'brotli', index = False )
# Parquet file takes up little space
os.path.getsize('test1.pqt')
6594712
getsizeof(sampleDF)
22827172
sampleDF['totalCites2'] = sampleDF['totalCites2'].apply(lambda x: np.array(x))
#RAM reduced if the variable length batches are in numpy
getsizeof(sampleDF)
22401764
#Much less RAM as a numpy object
sampleNumpy = sampleDF.values
getsizeof(sampleNumpy)
112
# Much more space in .npy form
np.save( 'test2.npy', sampleNumpy)
os.path.getsize('test2.npy')
20825382
# Numpy savez. Not as good as parquet
np.savez_compressed( 'test3.npy', sampleNumpy )
os.path.getsize('test3.npy.npz')
9873964
答案 0 :(得分:6)
您可以直接使用 Apache Arrow (pyarrow) 读取/写入 numpy 数组到 parquet,它也是 Pandas 中 parquet 的底层后端。 请注意,parquet 是一种表格格式,因此仍然需要创建一些表格。
import numpy as np
import pyarrow as pa
np_arr = np.array([1.3, 4.22, -5], dtype=np.float32)
pa_table = pa.table({"data": np_arr})
pa.parquet.write_table(pa_table, "test.parquet")
答案 1 :(得分:1)
Parquet 格式可以使用 pyarrow
编写,正确的导入语法是:
import pyarrow.parquet as pq
以便您可以使用 pq.write_table
。否则使用 import pyarrow as pa', 'py.parquet.write_table
将返回:AttributeError: module 'pyarrow' has no attribute 'parquet'
。
Pyarrow 要求按列组织数据,这意味着在 numpy
多维数组的情况下,您需要将每个维度分配给 parquet
列中的特定字段。
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
ndarray = np.array(
[
[4.96266477e05, 4.55342071e06, -1.03240000e02, -3.70000000e01, 2.15592864e01],
[4.96258372e05, 4.55344875e06, -1.03400000e02, -3.85000000e01, 2.40120775e01],
[4.96249387e05, 4.55347732e06, -1.03330000e02, -3.47500000e01, 2.70718535e01],
]
)
ndarray_table = pa.table(
{
"X": ndarray[:, 0],
"Y": ndarray[:, 1],
"Z": ndarray[:, 2],
"Amp": ndarray[:, 3],
"Ang": ndarray[:, 4],
}
)
pq.write_table(ndarray_table, "ndarray.parquet")