Question

我有一些大于10 GB的数据集（tsv格式），我需要hdf5格式。我正在使用Python。我读过有关Pandas软件包在读取文件并将其存储为hdf5时不占用太多内存的问题。但是，如果没有我的机器耗尽内存，我无法这样做。我也曾尝试过Spark，但在那里我感觉不舒服。那么，除了在内存中读取大量文件之外，我还有什么替代解决方案？

Answer 1

import pandas as pd
import numpy as np

# I use python3.4
# if your python version is 2.x, replace it with 'import StringIO'
import io


# generate some 'large' tsv
raw_data = pd.DataFrame(np.random.randn(10000, 5), columns='A B C D E'.split())
raw_tsv = raw_data.to_csv(sep='\t') 
# start to read csv in chunks, 50 rows per chunk (adjust it to the potential of your PC)
# the use of StringIO is just to provide a string buffer, you don't need this
# if you are reading from an external file, just put the file path there
file_reader = pd.read_csv(filepath_or_buffer=io.StringIO(raw_tsv), sep='\t', chunksize=50)
# try to show you what's inside each chunk
# if you type:      list(file_reader)[0]
# exactly 50 rows
# don't do this in your real processing, file_reader is a lazy generator
# and it can only be consumed once

    Unnamed: 0       A       B       C       D       E
0            0 -1.2553  0.1386  0.6201  0.1014 -0.4067
1            1 -1.0127 -0.8122 -0.0850 -0.1887 -0.9169
2            2  0.5512  0.7816  0.0729 -1.1310 -0.8213
3            3  0.1159  1.1608 -0.4519 -2.1344  0.1520
4            4 -0.5375 -0.6034  0.7518 -0.8381  0.3100
5            5  0.5895  0.5698 -0.9438  3.4536  0.5415
6            6 -1.2809  0.5412  0.5298 -0.8242  1.8116
7            7  0.7242 -1.6750  1.0408 -0.1195  0.6617
8            8 -1.4313 -0.4498 -1.6069 -0.7309 -1.1688
9            9 -0.3073  0.3158  0.6478 -0.6361 -0.7203
..         ...     ...     ...     ...     ...     ...
40          40 -0.3143 -1.9459  0.0877 -0.0310 -2.3967
41          41 -0.8487  0.1104  1.2564  1.0890  0.6501
42          42  1.6665 -0.0094 -0.0889  1.3877  0.7752
43          43  0.9872 -1.5167  0.0059  0.4917  1.8728
44          44  0.4096 -1.2913  1.7731  0.3443  1.0094
45          45 -0.2633  1.8474 -1.0781 -1.4475 -0.2212
46          46 -0.2872 -0.0600  0.0958 -0.2526  0.1531
47          47 -0.7517 -0.1358 -0.5520 -1.0533 -1.0962
48          48  0.8421 -0.8751  0.5380  0.7147  1.0812
49          49 -0.8216  1.0702  0.8911  0.5189 -0.1725

[50 rows x 6 columns]

# set up your HDF5 file with highest possible compress ratio 9
h5_file = pd.HDFStore('your_hdf5_file.h5', complevel=9, complib='blosc')

h5_file
Out[18]: 
<class 'pandas.io.pytables.HDFStore'>
File path: your_hdf5_file.h5
Empty


# now, start processing
for df_chunk in file_reader:
    # must use append method
    h5_file.append('big_data', df_chunk, complevel=9, complib='blosc')

# after processing, close hdf5 file
h5_file.close()


# check your hdf5 file, 
pd.HDFStore('your_hdf5_file.h5')
# now it has all 10,000 rows, and we did this chunk by chunk

Out[21]: 
<class 'pandas.io.pytables.HDFStore'>
File path: your_hdf5_file.h5
/big_data            frame_table  (typ->appendable,nrows->10000,ncols->6,indexers->[index])

如何在不耗尽内存的情况下读取tsv文件并将其存储为hdf5？

1 个答案: