如何在单独的数据帧之间合并和分组依据

时间:2019-12-09 21:09:15

标签: python pandas

我有两个要合并/分组的数据框。它们在下面:

df_1


        words      start   stop
0            Oh,    6.72   7.21
1          okay,    7.26   8.01
2             go  12.82   12.90
3         ahead.   12.91  12.94
4             NaN  15.29  15.62
5             NaN  15.63  15.99
6             NaN  16.09  16.36
7             NaN  16.37  16.96
8             NaN  17.88  18.36
9             NaN  18.37  19.36

df_2

data     start        stop
10         1.0        3.5
14         4.0       8.5
11         9.0       13.5
12        14.0       20.5

我想将df_1.words合并到df_2,但是将df_1.start中df_1.start在df_2.start和df_2.stop之间的所有值分组。它应该看起来像这样:

df_2

data     start        stop   words
10         1.0        3.5     NaN
14         4.0       8.5      Oh, okay,
11         9.0       13.5     go ahead.
12        14.0       20.5     NaN, NaN, NaN, NaN, NaN, NaN

3 个答案:

答案 0 :(得分:2)

如果两个数据帧不太长,我们可以进行交叉联接:

(df2.assign(dummy=1)
    .merge(df.assign(dummy=1), on='dummy',
           how='left', suffixes=['','_r']
          )
    .query('start<=start_r<=stop')
    .groupby(['data','start','stop'],as_index=False)
    .agg({'words':list})
)

输出:

   data  start  stop                           words
0    11    9.0  13.5                    [go, ahead.]
1    12   14.0  20.5  [nan, nan, nan, nan, nan, nan]
2    14    4.0   8.5                    [Oh,, okay,]

答案 1 :(得分:1)

如果bin边不像您的示例中那样重叠,请使用pd.cutIntervalIndex将第一个DataFrame分组。这使您可以在两侧闭合。然后从df_2的“停止”列中进行选择,以获取汇总结果。

import pandas as pd

idx = pd.Index([pd.Interval(*x, closed='both') for x in zip(df_2.start, df_2.stop)])

s = df_1.groupby(pd.cut(df_1.start, idx)).words.agg(list)

# Closed on both, can use `'stop'` to align
df_2['words'] = s[df_2.stop].to_list()

print(df_2)
   data  start  stop                           words
0    10    1.0   3.5                              []
1    14    4.0   8.5                    [Oh,, okay,]
2    11    9.0  13.5                    [go, ahead.]
3    12   14.0  20.5  [nan, nan, nan, nan, nan, nan]

答案 2 :(得分:1)

使用:

cut=pd.cut(df_1['start'],df_2[['start','stop']].stack())
mapper=df_1.groupby(cut).words.agg(lambda x: ' '.join(x.astype(str)))
mapper.index=mapper.index.to_series().apply(lambda x: x.left)
df_2['words']=df_2['start'].map(mapper)

print(df_2)

   data  start  stop                    words
0    10    1.0   3.5                         
1    14    4.0   8.5                Oh, okay,
2    11    9.0  13.5                go ahead.
3    12   14.0  20.5  nan nan nan nan nan nan