我正在循环请求数据,然后将它们插入pandas DataFrame中。但是,此过程似乎过慢。可以使用Numpy数组,但是数据之间有很多空白,需要将其添加到正确的索引上。
我现在找到的最快的解决方案是通过“块”合并数据。更新方法(也许是错误地使用了)是所有方法中最慢的,只要在这里对其进行注释。
import pandas as pd
import random
import time
index_size = 3000
def get_data():
values_size = 100
index = random.sample(range(index_size),values_size)
values = [random.uniform(0,1) for i in range(0,values_size)]
index.sort()
return pd.DataFrame( index=index, columns=values )
nb_of_columns = 500
def fill_merge():
df = pd.DataFrame(index = range(0,index_size))
for i in range(0,nb_of_columns):
data = get_data()
df = df.merge(data, how='left', left_index = True, right_index=True)
def fill_update():
df = pd.DataFrame(index=range(0,index_size),
columns = [str(i) for i in range(0,index_size)])
for i in range(0,nb_of_columns):
data = get_data()
data.columns = [str(i)]
df.update(data, join='left')
def fill_chunk():
chunk_size = 100
df = pd.DataFrame(index=range(0,index_size))
df_chunk = pd.DataFrame(index=range(0,index_size))
chunk = 0
for i in range(0,nb_of_columns):
data = get_data()
df_chunk = df_chunk.merge(data, how='left', left_index=True, right_index=True)
chunk+=1
if(len(df_chunk.columns) > chunk_size):
df = df.merge(df_chunk, how='left', left_index=True, right_index=True)
df_chunk = pd.DataFrame(index=range(0,index_size))
chunk=0
df = df.merge(df_chunk, how='left', left_index=True, right_index=True)
t_start = time.time()
fill_merge()
t_end = time.time()
print(t_end-t_start)
t_start = time.time()
#fill_update()
t_end = time.time()
print(t_end-t_start)
t_start = time.time()
fill_chunk()
t_end = time.time()
print(t_end-t_start)
直接合并方法的结果约为160s,与块合并的结果为85s。
有什么更快的方法吗?