将小熊猫数据框合并为较大的数据,按规则复制值

时间:2019-05-06 05:46:50

标签: python pandas

有两个数据框,它们的日期时间对象都以5分钟df_05min或15分钟df_15min的增量递增。

df_05min = pd.DataFrame({'dt':['2008-10-2404:12:30',
                                '2008-10-2404:12:35',
                                '2008-10-2404:12:40',
                                '2008-10-2404:12:45',
                                '2008-10-2404:12:50',
                                '2008-10-2404:13:00',
                                '2008-10-2404:13:05']})

df_15min = pd.DataFrame([['2008-10-2404:12:15',  'L'],
                        ['2008-10-2404:12:30',  'r'],
                        ['2008-10-2404:12:45',  'S'  ],
                        ['2008-10-2404:13:00',  'L'],
                        ['2008-10-2404:13:15',  'L' ]], columns=['dt','col'])

目标是将df_15min数据帧合并到datetime列df_05min上的dt数据帧中,并将一些附带的数据复制到适当的行中。这是外部合并的替代方式,在外部合并中,不匹配的值将获得NaN。例如,在df_15min中,“ 2008-10-2404:12:30”具有一个值np.nan,我想将其复制到属于{{1 }}。这意味着12:30、12:35和12:40的值均为df_05min

所需的最终产品如下:

np.nan

2 个答案:

答案 0 :(得分:1)

尝试将mergehow='outer'fillnasort_values结合使用:

print(df_05min.merge(df_15min,how='outer').ffill().sort_values('dt'))

输出:

                   dt col
7  2008-10-2404:12:15   L
0  2008-10-2404:12:30   r
1  2008-10-2404:12:35   r
2  2008-10-2404:12:40   r
3  2008-10-2404:12:45   S
4  2008-10-2404:12:50   S
5  2008-10-2404:13:00   L
6  2008-10-2404:13:05   L
8  2008-10-2404:13:15   L

如果您关心索引,请使用:

print(df_05min.merge(df_15min,how='outer').ffill().sort_values('dt').reset_index(drop=True))

答案 1 :(得分:1)

这里需要merge_asof和外部联接,但尚未实现,因此可能的解决方案是DataFrame.merge,按DataFrame.sort_values排序,向前填充缺失值,最后按{{3}创建默认索引}}:

df_05min = pd.DataFrame({'dt':['2008-10-24 04:12:30',
                                '2008-10-24 04:12:35',
                                '2008-10-24 04:12:40',
                                '2008-10-24 04:12:45',
                                '2008-10-24 04:12:50',
                                '2008-10-24 04:13:00',
                                '2008-10-24 04:13:05']})

df_15min = pd.DataFrame([['2008-10-24 04:12:15',  'L'],
                        ['2008-10-24 04:12:30',  'r'],
                        ['2008-10-24 04:12:45',  'S'  ],
                        ['2008-10-24 04:13:00',  'L'],
                        ['2008-10-24 04:13:15',  'L' ]], columns=['dt','col'])

df_05min['dt'] = pd.to_datetime(df_05min['dt'])
df_15min['dt'] = pd.to_datetime(df_15min['dt'])

df=pd.merge(df_05min, df_15min, how='outer').sort_values('dt').ffill().reset_index(drop=True)
print (df)
                   dt col
0 2008-10-24 04:12:15   L
1 2008-10-24 04:12:30   r
2 2008-10-24 04:12:35   r
3 2008-10-24 04:12:40   r
4 2008-10-24 04:12:45   S
5 2008-10-24 04:12:50   S
6 2008-10-24 04:13:00   L
7 2008-10-24 04:13:05   L
8 2008-10-24 04:13:15   L