Question

我将一个非常大的数据框保存为gzip文件。在保存之前，数据还需要大量操作。

可以尝试将整个gzip数据帧转换为文本格式，将其保存到变量，解析/清理数据，然后通过.csv保存为pandas.read_csv()文件。但是，这是非常耗费内存的。

我想逐行读取/解压缩这个文件（因为这将是最节省内存的解决方案，我认为），解析这个（例如使用正则表达式re或者pandas解决方案）然后将每一行保存到pandas数据帧中。

Python有一个gzip库：

with gzip.open('filename.gzip', 'rb') as input_file:
    reader = reader(input_file, delimiter="\t")
    data = [row for row in reader]
df = pd.DataFrame(data)

然而，这似乎将所有信息都丢弃到'reader'变量中，然后进行解析。如何以更高（内存）效率的方式做到这一点？

我应该使用其他库而不是gzip吗？

Answer 1

目前还不清楚你想用你的巨大GZIP文件做什么。 IIUC你无法将整个数据读入内存，因为你的GZIP文件很庞大。因此，您唯一的选择是以块的形式处理数据。

假设您要从GZIP文件中读取数据，请对其进行处理并将其写入压缩的HDF5文件：

hdf_key = 'my_hdf_ID'
cols_to_index = ['colA','colZ'] # list of indexed columns, use `cols_to_index=True` if you want to index ALL columns
store = pd.HDFStore('/path/to/filename.h5')
chunksize = 10**5
for chunk in pd.read_csv('filename.gz', sep='\s*', chunksize=chunksize):
    # process data in the `chunk` DF

    # don't index data columns in each iteration - we'll do it later
    store.append(hdf_key, chunk, data_columns=cols_to_index, index=False, complib='blosc', complevel=4)

# index data columns in HDFStore
store.create_table_index(hdf_key, columns=cols_to_index, optlevel=9, kind='full')
store.close()

Answer 2

也许用gunzip -c提取数据，将其传递给Python脚本并使用标准输入：

$ gunzip -c source.gz | python ./line_parser.py | gzip -c - > destination.gz

在Python脚本line_parser.py中：

#!/usr/bin/env python
import sys
for line in sys.stdin:
    sys.stdout.write(line)

将sys.stdout.write(line)替换为代码，以自定义方式处理每一行。

Answer 3

您是否考虑过使用HDFStore：

HDFStore是一个类似dict的对象，使用优秀的PyTables库，使用高性能HDF5格式读取和编写pandas。有关高级策略，请参阅食谱

创建商店，保存数据框并关闭商店。

# Note compression.
store = pd.HDFStore('my_store.h5', mode='w', comp_level=9, complib='blosc')
with store:
    store['my_dataframe'] = df

重新打开商店，检索数据框并关闭商店。

with pd.HDFStore('my_store.h5', mode='r') as store:
    df = store.get('my_dataframe')

使用gzip数据帧，我如何逐行读取/解压缩？

3 个答案: