Question

我有大约7000个同类DataFrame（相同列但不同大小），并希望将它们连接成一个大的DataFrame以供进一步分析。

如果我生成所有内容并存储到list，内存会爆炸，因此我无法使用pandas.concat([...all my tables...])，但选择执行以下操作：

bit_table = None
for table in readTables():
    big_table = pandas.concat([big_table, table], ignore_index=True)

我想知道for loop方式与pandas.concat([...all tables...])方式相比的效率。它们的速度是否相同？

由于表格是同质的并且索引无关紧要，是否有任何技巧可以加速连接？

Answer 1

以下是使用pd.HDFStore将多个表附加在一起的示例。

import pandas as pd
import numpy as np
from time import time

# your tables
# =========================================
columns = ['col{}'.format(i) for i in range(100)]
data = np.random.randn(100000).reshape(1000, 100)
df = pd.DataFrame(data, columns=columns)

# many tables, generator
def get_generator(df, n=1000):
    for x in range(n):
        yield df

table_reader = get_generator(df, n=1000)


# processing
# =========================================
# create a hdf5 storage, compression level 5, (1-9, 9 is extreme)
h5_file = pd.HDFStore('/home/Jian/Downloads/my_hdf5_file.h5', complevel=5, complib='blosc')

Out[2]: 
<class 'pandas.io.pytables.HDFStore'>
File path: /home/Jian/Downloads/my_hdf5_file.h5
Empty


t0 = time()

# loop over your df
counter = 1
for frame in table_reader:
    print('Appending Table {}'.format(counter))
    h5_file.append('big_table', frame, complevel=5, complib='blosc')
    counter += 1

t1 = time()

# Appending Table 1
# Appending Table 2
# ...
# Appending Table 999
# Appending Table 1000


print(t1-t0)

Out[3]: 41.6630880833

# check our hdf5_file
h5_file

Out[7]: 
<class 'pandas.io.pytables.HDFStore'>
File path: /home/Jian/Downloads/my_hdf5_file.h5
/big_table            frame_table  (typ->appendable,nrows->1000000,ncols->100,indexers->[index])

# close hdf5
h5_file.close()

# very fast to retrieve your data in any future IPython session

h5_file = pd.HDFStore('/home/Jian/Downloads/my_hdf5_file.h5')

%time my_big_table = h5_file['big_table']

CPU times: user 217 ms, sys: 1.11 s, total: 1.33 s
Wall time: 1.89 s

快速连接大量同构数据帧

1 个答案: