我有两个要合并的数据框。两者都持续300秒(在“开始”列中)。他们在下面。
df_1:
color start stop
0 blue 2.72 2.85
1 green 2.86 3.09
2 blue 3.10 3.47
3 green 3.48 4.69
4 blue 4.70 5.97
5 green 5.98 7.07
df_2:
confidence start
0 .11 2.79
1 .78 2.99
2 .65 3.04
3 .22 3.43
4 .54 3.61
5 .99 3.99
6 .52 4.24
7 .63 4.31
8 .71 4.67
9 .82 4.85
10 .81 5.09
11 .33 5.26
12 .31 5.69
13 .44 5.99
14 .55 6.22
15 .81 6.43
16 .31 6.93
17 .32 7.01
…等等
当df_2 ['start']值介于df_1 ['start']和df_1 ['stop']值之间时,我想合并df_2 ['confidence']的聚合平均值。
理想情况下,它看起来像这样:
color start stop confidence
0 blue 2.72 2.85 .11
1 green 2.86 3.09 .72
2 blue 3.10 3.47 .22
3 green 3.48 4.69 .68
4 blue 4.70 5.97 .57
5 green 5.98 7.07 .49
谢谢!
答案 0 :(得分:3)
您可以使用IntervalIndex
来构建间隔树,然后使用df2['start']
获得IntervalIndex.get_indexer
的位置,最后进行分组并找到均值:
idx = pd.IntervalIndex.from_arrays(df['start'], df['stop'])
df.join(
df2.groupby(idx.get_indexer(df2['start']))['confidence'].mean(), how='left')
color start stop confidence
0 blue 2.72 2.85 0.1100
1 green 2.86 3.09 0.7150
2 blue 3.10 3.47 0.2200
3 green 3.48 4.69 0.6780
4 blue 4.70 5.97 0.5675
5 green 5.98 7.07 0.4860
答案 1 :(得分:3)
IIUC,您可以使用pd.cut
和groupby
,然后使用merge
:
# bins for cut
bins=[df1.start[0] ] + df1.stop.to_list()
# label the start in df2 by cuts:
s = pd.cut(df2.start, bins=bins, labels=df1.start)
# group df2 by the cuts:
new_df = df2.groupby(s).confidence.mean()
# merge
df1.merge(new_df, left_on='start', right_index=True)
给您
color start stop confidence
0 blue 2.72 2.85 0.110000
1 green 2.85 3.09 0.715000
2 blue 3.09 3.47 0.220000
3 green 8.43 8.69 0.577857
4 blue 8.69 8.97 NaN
5 green 8.97 9.07 NaN
使用经过编辑的df1
(这与输出匹配是有意义的):
color start stop confidence
0 blue 2.72 2.85 0.1100
1 green 2.86 3.09 0.7150
2 blue 3.1 3.47 0.2200
3 green 3.48 4.69 0.6780
4 blue 4.7 5.97 0.5675
5 green 5.98 7.07 0.4860