我有大约7000个同类DataFrame(相同列但不同大小),并希望将它们连接成一个大的DataFrame以供进一步分析。
如果我生成所有内容并存储到list
,内存会爆炸,因此我无法使用pandas.concat([...all my tables...])
,但选择执行以下操作:
bit_table = None
for table in readTables():
big_table = pandas.concat([big_table, table], ignore_index=True)
我想知道for loop
方式与pandas.concat([...all tables...])
方式相比的效率。它们的速度是否相同?
由于表格是同质的并且索引无关紧要,是否有任何技巧可以加速连接?
答案 0 :(得分:1)
以下是使用pd.HDFStore
将多个表附加在一起的示例。
import pandas as pd
import numpy as np
from time import time
# your tables
# =========================================
columns = ['col{}'.format(i) for i in range(100)]
data = np.random.randn(100000).reshape(1000, 100)
df = pd.DataFrame(data, columns=columns)
# many tables, generator
def get_generator(df, n=1000):
for x in range(n):
yield df
table_reader = get_generator(df, n=1000)
# processing
# =========================================
# create a hdf5 storage, compression level 5, (1-9, 9 is extreme)
h5_file = pd.HDFStore('/home/Jian/Downloads/my_hdf5_file.h5', complevel=5, complib='blosc')
Out[2]:
<class 'pandas.io.pytables.HDFStore'>
File path: /home/Jian/Downloads/my_hdf5_file.h5
Empty
t0 = time()
# loop over your df
counter = 1
for frame in table_reader:
print('Appending Table {}'.format(counter))
h5_file.append('big_table', frame, complevel=5, complib='blosc')
counter += 1
t1 = time()
# Appending Table 1
# Appending Table 2
# ...
# Appending Table 999
# Appending Table 1000
print(t1-t0)
Out[3]: 41.6630880833
# check our hdf5_file
h5_file
Out[7]:
<class 'pandas.io.pytables.HDFStore'>
File path: /home/Jian/Downloads/my_hdf5_file.h5
/big_table frame_table (typ->appendable,nrows->1000000,ncols->100,indexers->[index])
# close hdf5
h5_file.close()
# very fast to retrieve your data in any future IPython session
h5_file = pd.HDFStore('/home/Jian/Downloads/my_hdf5_file.h5')
%time my_big_table = h5_file['big_table']
CPU times: user 217 ms, sys: 1.11 s, total: 1.33 s
Wall time: 1.89 s