在某些情况下更快地复制熊猫数据

时间:2019-08-26 22:05:30

标签: python pandas for-loop bigdata

我有一个数据框( df_main ),我想从另一个数据框( df_data )中查找必要的列,从而将数据复制到其中。

df_data
   name  Index     par_1   par_2 ... par_n
0    A1      1        a0      b0
1    A1      2        a1
2    A1      3        a2
3    A1      4        a3 
4    A2      2        a4
...    

df_main
   name Index_0  Index_1    
0    A1       1        2
1    A1       1        3
2    A1       1        4
3    A1       2        3 
4    A1       2        4
5    A1       3        4
...

我想将参数列从 df_data 复制到 df_main ,条件是将df_data行中具有相同名称和索引的所有参数都复制到df_main。 我使用for循环进行了以下实现,该实现实际上太慢了,无法使用:

def data_copy(df, df_data, indice):
    '''indice: whether Index_0 or Index_1 is being checked'''
    names = df['name'].unique()
    # We get all different names in the dataset to loop over
    for name in tqdm.tqdm(names):
        # Get unique index for a specific name
        indexes = df[df['name']== name][indice].unique()
        # Looping over all indexes
        for index in indexes:
            # From df_data, get the data of all cols of specific name and data
            data = df_data[(df_data['Index']==index) & (df_data['name'] == name)]

            # columns: Only the cols of structure's data
            req_data = data[columns]

            for col in columns:
                # For each col (e.g. g1, g2, etc), get the val of a specific index
                val = df_struc.loc[(df_data['Index']==index) & (df_data['name'] == name), col]
                df.loc[(df[indice] == index) & (df['name']== name), col] = val[val.index.item()]
    return df

df_main = data_copy(df_main, df_data, 'Index_0') 

这给了我我所需要的:

df_main
   name Index_0  Index_1   par_1    par_2 ...
0    A1       1        2      a0
1    A1       1        3      a0    
2    A1       1        4      a0
3    A1       2        3      a1
4    A1       2        4      a1
5    A1       3        4      a2

但是,在非常大的数据上运行它需要很多时间。避免for循环以获得更快实现的最佳方法是什么?

1 个答案:

答案 0 :(得分:0)

对于每个数据框,您可以创建一个新列,该列将同时连接名称和索引。参见下文:

import pandas as pd

df1 = {'name':['A1','A1'],'index':['1','2'],'par_1':['a0','a1']}
df1 = pd.DataFrame(data=df1)
df1['new'] = df1['name'] + df1['index'] 
df1

df2 = {'name':['A1','A1'],'index_0':['1','2'],'index_1':['2','3']}
df2 = pd.DataFrame(data=df2)
df2['new'] = df2['name'] + df2['index_0'] 
df2

for i, row in df1.iterrows():
    df2.loc[(df2['new'] == row['new']) , 'par_1'] = row['par_1']
df2 

结果:

    name index_0 index_1 new    par_1
0   A1   1       2       A11    a0
1   A1   2       3       A12    a1