我有4个dfs,如下所示
DF1
_id bs ds as pf
0 2017-05-01 00:00:00 0.982218 0.906662 0.614119 0.999471
1 2017-05-01 00:05:00 0.983751 0.913266 0.585237 0.999571
2 2017-05-01 00:10:00 0.983012 0.914875 0.592698 0.999631
3 2017-05-01 00:15:00 0.981884 0.910922 0.589013 0.999536
4 2017-05-01 00:20:00 0.982611 0.913082 0.601056 0.999556
5 2017-05-01 00:25:00 0.982386 0.912358 0.598856 0.999650
DF2
_id avg_time_serve
0 2017-05-01 00:00:00 0.520681
1 2017-05-01 00:05:00 0.521580
2 2017-05-01 00:10:00 0.517993
3 2017-05-01 00:15:00 0.520662
4 2017-05-01 00:20:00 0.514146
5 2017-05-01 00:25:00 0.513723
DF3
_id total_distinct_ips
0 2017-05-01 00:00:00 291094.0
1 2017-05-01 00:05:00 287922.0
2 2017-05-01 00:10:00 292103.0
3 2017-05-01 00:15:00 295675.0
4 2017-05-01 00:20:00 297813.0
5 2017-05-01 00:25:00 302406.0
DF4
_id total_40x total_50x
0 2017-05-01 00:00:00 162034 0
1 2017-05-01 00:05:00 162497 0
2 2017-05-01 00:10:00 161079 0
3 2017-05-01 00:15:00 163338 0
4 2017-05-01 00:20:00 167901 0
5 2017-05-01 00:25:00 164394 0
我试图通过' _id'柱。 ' _id'列是时间戳格式。
我尝试使用以下方法:
**Approach 1**
from functools import reduce
dfs = [df1, df2, df3, df4]
final_df = reduce(lambda left,right: pd.merge(left, right, on='_id',
how='outer'), dfs)
**Approach 2**
final_df = pd.Dataframe()
for df in dfs:
if final_df.empty:
final_df = df
else:
final_df = pd.merge(final_df, df, how='outer', on='_id')
两种方法都给出了以下结果:
_id bs ds as pf \
0 2017-05-01 00:00:00 0.982218 0.906662 0.614119 0.999471
1 2017-05-01 00:00:00 NaN NaN NaN NaN
2 2017-05-01 00:05:00 0.983751 0.913266 0.585237 0.999571
3 2017-05-01 00:05:00 NaN NaN NaN NaN
4 2017-05-01 00:10:00 0.983012 0.914875 0.592698 0.999631
5 2017-05-01 00:10:00 NaN NaN NaN NaN
avg_time_serve total_distinct_ips total_40x total_50x
0 NaN 291094.0 162034 0
1 0.520681 291094.0 162034 0
2 NaN 287922.0 162497 0
3 0.521580 287922.0 162497 0
4 NaN 292103.0 161079 0
5 0.517993 292103.0 161079 0
方法3
我拿出了' df1'来自dfs列表,并添加了一个' join'。
from functools import reduce
dfs = [df2, df3, df4]
final_df = reduce(lambda left,right: pd.merge(left, right, on='_id',
how='outer'), dfs)
final_df = final_df.join(df1.set_index('_id'), on='_id')
最后得到了正确的结果
_id avg_time_serve total_distinct_ips total_40x
0 2017-05-01 00:00:00 0.520681 291094.0 162034
1 2017-05-01 00:05:00 0.521580 287922.0 162497
2 2017-05-01 00:10:00 0.517993 292103.0 161079
3 2017-05-01 00:15:00 0.520662 295675.0 163338
4 2017-05-01 00:20:00 0.514146 297813.0 167901
5 2017-05-01 00:25:00 0.513723 302406.0 164394
total_50x bs ds as pf
0 0 0.982218 0.906662 0.614119 0.999471
1 0 0.983751 0.913266 0.585237 0.999571
2 0 0.983012 0.914875 0.592698 0.999631
3 0 0.981884 0.910922 0.589013 0.999536
4 0 0.982611 0.913082 0.601056 0.999556
5 0 0.982386 0.912358 0.598856 0.999650
问题:
对于合并在一起的任意数量的数据帧,不应该#1和#2工作吗?
为什么方法1和2创建了重复的' _id'并插入NaN值?
答案 0 :(得分:0)
您也可以将pd.concat与set_index
一起使用pd.concat([df1.set_index('_id'), df2.set_index('_id'), df3.set_index('_id'), df4.set_index('_id')], axis = 1).reset_index()