Python数据框:根据条件合并两个数据框(熊猫)

时间:2020-06-02 19:22:02

标签: python pandas

假设我有两个DataFrame:

DATAFRAME 1
    onset  offset
0       1     200
1     201     400
2     401     600
3     601     800
4     801    1000
5    1001    1200
6    1201    1400
7    1401    1600
8    1601    1800
9    1801    2000
10   2001    2200
11   2201    2400
12   2401    2600
13   2601    2800
14   2801    3000
15   3001    3200
16   3201    3400
17   3401    3600
18   3601    3800
19   3801    4000
20   4001    4200
21   4201    4400
22   4401    4600
23   4601    4800
24   4801    5000
25   5001    5200
26   5201    5400
27   5401    5600
28   5601    5800
29   5801    6000
DATAFRAME 2
   onset rhythm_name  rhythm_code  offset
0      1         NSR          100    2760
1   2761  JUNCTIONAL         4000    3938
2   3939         NSR          100    6000

我的目标是将两个数据框以起始偏移量间隔合并,并添加各自的 rhythm_name rhythm_code 以获得如下信息:

    onset  offset  rhythm_name  rhythm_code
0       1     200        NSR          100 
1     201     400        NSR          100
2     401     600        NSR          100
3     601     800        NSR          100
4     801    1000        NSR          100
5    1001    1200        NSR          100
6    1201    1400        NSR          100
7    1401    1600        NSR          100
8    1601    1800        NSR          100
9    1801    2000        NSR          100
10   2001    2200        NSR          100
11   2201    2400        NSR          100
12   2401    2600        NSR          100
13   2601    2800        Null         Null
14   2801    3000  JUNCTIONAL         4000
15   3001    3200  JUNCTIONAL         4000
16   3201    3400  JUNCTIONAL         4000
17   3401    3600  JUNCTIONAL         4000
18   3601    3800  JUNCTIONAL         4000
19   3801    4000        Null         Null
20   4001    4200        NSR          100
21   4201    4400        NSR          100
22   4401    4600        NSR          100
23   4601    4800        NSR          100
24   4801    5000        NSR          100
25   5001    5200        NSR          100
26   5201    5400        NSR          100
27   5401    5600        NSR          100
28   5601    5800        NSR          100
29   5801    6000        NSR          100

我该怎么做?我找不到解决此问题的方法。我已经尝试过类似的东西:

df1["rhythm_name"] = df2[(df1['onset'] >= df2['onset']) & (df1['offset'] <= df2['offset'])])

我明白了:

ValueError: Can only compare identically-labeled Series objects

我制作了一个脚本来重现该问题:

df1 = pd.DataFrame()
onsets = []
for i in range(0,30):
  onset = i * 200 + 1
  onsets.append(onset)
df1['onset'] = onsets
df1['offset'] = df1["onset"]+200-1

df2 = {'onset': [1, 2761, 3939],
       'offset': [2760, 3938, 6000],
       'rhythm_name': ["NSR", "JUNCTIONAL", "NSR"],
       'rhythm_code': [100, 4000, 100]}

2 个答案:

答案 0 :(得分:4)

您可以pd.merge_asof并掩盖第二个条件:

dfm = pd.merge_asof(df1, df2, on='onset', direction='backward', suffixes=('','_y'))
dfm[['rhythm_name', 'rhythm_code']] = (dfm[['rhythm_name', 'rhythm_code']]
                                          .where(dfm['offset'] <= dfm['offset_y']))
dfm.drop('offset_y', axis=1)

输出:

    onset  offset rhythm_name  rhythm_code
0       1     200         NSR        100.0
1     201     400         NSR        100.0
2     401     600         NSR        100.0
3     601     800         NSR        100.0
4     801    1000         NSR        100.0
5    1001    1200         NSR        100.0
6    1201    1400         NSR        100.0
7    1401    1600         NSR        100.0
8    1601    1800         NSR        100.0
9    1801    2000         NSR        100.0
10   2001    2200         NSR        100.0
11   2201    2400         NSR        100.0
12   2401    2600         NSR        100.0
13   2601    2800         NaN          NaN
14   2801    3000  JUNCTIONAL       4000.0
15   3001    3200  JUNCTIONAL       4000.0
16   3201    3400  JUNCTIONAL       4000.0
17   3401    3600  JUNCTIONAL       4000.0
18   3601    3800  JUNCTIONAL       4000.0
19   3801    4000         NaN          NaN
20   4001    4200         NSR        100.0
21   4201    4400         NSR        100.0
22   4401    4600         NSR        100.0
23   4601    4800         NSR        100.0
24   4801    5000         NSR        100.0
25   5001    5200         NSR        100.0
26   5201    5400         NSR        100.0
27   5401    5600         NSR        100.0
28   5601    5800         NSR        100.0
29   5801    6000         NSR        100.0

答案 1 :(得分:3)

如果数据不是太大,可以使用广播方法:

cond1 = df1.onset.values[:,None] >= df2.onset.values
cond2 = df1.offset.values[:,None] <= df2.offset.values

mask = (cond1&cond2)
idx = np.where(mask.any(1), mask.argmax(1), np.nan)

for col in ['rhythm_name', 'rhythm_code']:
    df1[col] = df2[col].reindex(idx).values

输出:

0       1     200         NSR        100.0
1     201     400         NSR        100.0
2     401     600         NSR        100.0
3     601     800         NSR        100.0
4     801    1000         NSR        100.0
5    1001    1200         NSR        100.0
6    1201    1400         NSR        100.0
7    1401    1600         NSR        100.0
8    1601    1800         NSR        100.0
9    1801    2000         NSR        100.0
10   2001    2200         NSR        100.0
11   2201    2400         NSR        100.0
12   2401    2600         NSR        100.0
13   2601    2800         NaN          NaN
14   2801    3000  JUNCTIONAL       4000.0
15   3001    3200  JUNCTIONAL       4000.0
16   3201    3400  JUNCTIONAL       4000.0
17   3401    3600  JUNCTIONAL       4000.0
18   3601    3800  JUNCTIONAL       4000.0
19   3801    4000         NaN          NaN
20   4001    4200         NSR        100.0
21   4201    4400         NSR        100.0
22   4401    4600         NSR        100.0
23   4601    4800         NSR        100.0
24   4801    5000         NSR        100.0
25   5001    5200         NSR        100.0
26   5201    5400         NSR        100.0
27   5401    5600         NSR        100.0
28   5601    5800         NSR        100.0
29   5801    6000         NSR        100.0

选项2 :使用merge_asof的另一种(更好)方法:

(pd.merge_asof(df1,df2,on='onset',direction='backward',suffixes=['','_y'])
   .query('offset<=offset_y')
   .reindex(df1.index)
   .drop('offset_y', axis=1)
   .fillna(df1)
)

您将得到相同的输出。