我有两个要合并/分组的数据框。它们在下面:
df_1
words start stop
0 Oh, 6.72 7.21
1 okay, 7.26 8.01
2 go 12.82 12.90
3 ahead. 12.91 12.94
4 NaN 15.29 15.62
5 NaN 15.63 15.99
6 NaN 16.09 16.36
7 NaN 16.37 16.96
8 NaN 17.88 18.36
9 NaN 18.37 19.36
df_2
data start stop
10 1.0 3.5
14 4.0 8.5
11 9.0 13.5
12 14.0 20.5
我想将df_1.words合并到df_2,但是将df_1.start中df_1.start在df_2.start和df_2.stop之间的所有值分组。它应该看起来像这样:
df_2
data start stop words
10 1.0 3.5 NaN
14 4.0 8.5 Oh, okay,
11 9.0 13.5 go ahead.
12 14.0 20.5 NaN, NaN, NaN, NaN, NaN, NaN
答案 0 :(得分:2)
如果两个数据帧不太长,我们可以进行交叉联接:
(df2.assign(dummy=1)
.merge(df.assign(dummy=1), on='dummy',
how='left', suffixes=['','_r']
)
.query('start<=start_r<=stop')
.groupby(['data','start','stop'],as_index=False)
.agg({'words':list})
)
输出:
data start stop words
0 11 9.0 13.5 [go, ahead.]
1 12 14.0 20.5 [nan, nan, nan, nan, nan, nan]
2 14 4.0 8.5 [Oh,, okay,]
答案 1 :(得分:1)
如果bin边不像您的示例中那样重叠,请使用pd.cut
和IntervalIndex
将第一个DataFrame分组。这使您可以在两侧闭合。然后从df_2
的“停止”列中进行选择,以获取汇总结果。
import pandas as pd
idx = pd.Index([pd.Interval(*x, closed='both') for x in zip(df_2.start, df_2.stop)])
s = df_1.groupby(pd.cut(df_1.start, idx)).words.agg(list)
# Closed on both, can use `'stop'` to align
df_2['words'] = s[df_2.stop].to_list()
print(df_2)
data start stop words
0 10 1.0 3.5 []
1 14 4.0 8.5 [Oh,, okay,]
2 11 9.0 13.5 [go, ahead.]
3 12 14.0 20.5 [nan, nan, nan, nan, nan, nan]
答案 2 :(得分:1)
使用:
cut=pd.cut(df_1['start'],df_2[['start','stop']].stack())
mapper=df_1.groupby(cut).words.agg(lambda x: ' '.join(x.astype(str)))
mapper.index=mapper.index.to_series().apply(lambda x: x.left)
df_2['words']=df_2['start'].map(mapper)
print(df_2)
data start stop words
0 10 1.0 3.5
1 14 4.0 8.5 Oh, okay,
2 11 9.0 13.5 go ahead.
3 12 14.0 20.5 nan nan nan nan nan nan