Python,Pandas:合并多个数据帧会导致NaN值不均匀的行重复

时间:2017-06-07 21:23:07

标签: python pandas

我有4个dfs,如下所示

DF1

                     _id        bs        ds        as        pf
0    2017-05-01 00:00:00  0.982218  0.906662  0.614119  0.999471
1    2017-05-01 00:05:00  0.983751  0.913266  0.585237  0.999571
2    2017-05-01 00:10:00  0.983012  0.914875  0.592698  0.999631
3    2017-05-01 00:15:00  0.981884  0.910922  0.589013  0.999536
4    2017-05-01 00:20:00  0.982611  0.913082  0.601056  0.999556
5    2017-05-01 00:25:00  0.982386  0.912358  0.598856  0.999650

DF2

                    _id  avg_time_serve  
0   2017-05-01 00:00:00        0.520681            
1   2017-05-01 00:05:00        0.521580            
2   2017-05-01 00:10:00        0.517993            
3   2017-05-01 00:15:00        0.520662            
4   2017-05-01 00:20:00        0.514146            
5   2017-05-01 00:25:00        0.513723            

DF3

                    _id   total_distinct_ips    
0   2017-05-01 00:00:00             291094.0     
1   2017-05-01 00:05:00             287922.0     
2   2017-05-01 00:10:00             292103.0     
3   2017-05-01 00:15:00             295675.0     
4   2017-05-01 00:20:00             297813.0     
5   2017-05-01 00:25:00             302406.0     

DF4

                    _id  total_40x  total_50x
0   2017-05-01 00:00:00     162034          0
1   2017-05-01 00:05:00     162497          0
2   2017-05-01 00:10:00     161079          0
3   2017-05-01 00:15:00     163338          0
4   2017-05-01 00:20:00     167901          0
5   2017-05-01 00:25:00     164394          0

我试图通过' _id'柱。 ' _id'列是时间戳格式。

我尝试使用以下方法:

**Approach 1**

from functools import reduce

dfs = [df1, df2, df3, df4]
final_df = reduce(lambda left,right: pd.merge(left, right, on='_id', 
           how='outer'), dfs)

**Approach 2**
final_df = pd.Dataframe()

for df in dfs:
    if final_df.empty:
        final_df = df
    else:
        final_df = pd.merge(final_df, df, how='outer', on='_id')

两种方法都给出了以下结果:

                    _id        bs        ds        as        pf  \
0   2017-05-01 00:00:00  0.982218  0.906662  0.614119  0.999471
1   2017-05-01 00:00:00       NaN       NaN       NaN       NaN
2   2017-05-01 00:05:00  0.983751  0.913266  0.585237  0.999571
3   2017-05-01 00:05:00       NaN       NaN       NaN       NaN
4   2017-05-01 00:10:00  0.983012  0.914875  0.592698  0.999631
5   2017-05-01 00:10:00       NaN       NaN       NaN       NaN

     avg_time_serve  total_distinct_ips  total_40x  total_50x
0               NaN            291094.0     162034          0
1          0.520681            291094.0     162034          0
2               NaN            287922.0     162497          0
3          0.521580            287922.0     162497          0
4               NaN            292103.0     161079          0
5          0.517993            292103.0     161079          0

方法3

我拿出了' df1'来自dfs列表,并添加了一个' join'。

from functools import reduce

dfs = [df2, df3, df4]
final_df = reduce(lambda left,right: pd.merge(left, right, on='_id', 
           how='outer'), dfs)
final_df = final_df.join(df1.set_index('_id'), on='_id')

最后得到了正确的结果

                    _id  avg_time_serve  total_distinct_ips  total_40x 
0   2017-05-01 00:00:00        0.520681            291094.0     162034
1   2017-05-01 00:05:00        0.521580            287922.0     162497
2   2017-05-01 00:10:00        0.517993            292103.0     161079
3   2017-05-01 00:15:00        0.520662            295675.0     163338
4   2017-05-01 00:20:00        0.514146            297813.0     167901
5   2017-05-01 00:25:00        0.513723            302406.0     164394

     total_50x        bs        ds        as        pf
0            0  0.982218  0.906662  0.614119  0.999471
1            0  0.983751  0.913266  0.585237  0.999571
2            0  0.983012  0.914875  0.592698  0.999631
3            0  0.981884  0.910922  0.589013  0.999536
4            0  0.982611  0.913082  0.601056  0.999556
5            0  0.982386  0.912358  0.598856  0.999650

问题:

  1. 对于合并在一起的任意数量的数据帧,不应该#1和#2工作吗?

  2. 为什么方法1和2创建了重复的' _id'并插入NaN值?

1 个答案:

答案 0 :(得分:0)

您也可以将pd.concat与set_index

一起使用
pd.concat([df1.set_index('_id'), df2.set_index('_id'), df3.set_index('_id'), df4.set_index('_id')], axis = 1).reset_index()