假设我有两个DataFrame:
DATAFRAME 1
onset offset
0 1 200
1 201 400
2 401 600
3 601 800
4 801 1000
5 1001 1200
6 1201 1400
7 1401 1600
8 1601 1800
9 1801 2000
10 2001 2200
11 2201 2400
12 2401 2600
13 2601 2800
14 2801 3000
15 3001 3200
16 3201 3400
17 3401 3600
18 3601 3800
19 3801 4000
20 4001 4200
21 4201 4400
22 4401 4600
23 4601 4800
24 4801 5000
25 5001 5200
26 5201 5400
27 5401 5600
28 5601 5800
29 5801 6000
DATAFRAME 2
onset rhythm_name rhythm_code offset
0 1 NSR 100 2760
1 2761 JUNCTIONAL 4000 3938
2 3939 NSR 100 6000
我的目标是将两个数据框以起始偏移量间隔合并,并添加各自的 rhythm_name 和 rhythm_code 以获得如下信息:
onset offset rhythm_name rhythm_code
0 1 200 NSR 100
1 201 400 NSR 100
2 401 600 NSR 100
3 601 800 NSR 100
4 801 1000 NSR 100
5 1001 1200 NSR 100
6 1201 1400 NSR 100
7 1401 1600 NSR 100
8 1601 1800 NSR 100
9 1801 2000 NSR 100
10 2001 2200 NSR 100
11 2201 2400 NSR 100
12 2401 2600 NSR 100
13 2601 2800 Null Null
14 2801 3000 JUNCTIONAL 4000
15 3001 3200 JUNCTIONAL 4000
16 3201 3400 JUNCTIONAL 4000
17 3401 3600 JUNCTIONAL 4000
18 3601 3800 JUNCTIONAL 4000
19 3801 4000 Null Null
20 4001 4200 NSR 100
21 4201 4400 NSR 100
22 4401 4600 NSR 100
23 4601 4800 NSR 100
24 4801 5000 NSR 100
25 5001 5200 NSR 100
26 5201 5400 NSR 100
27 5401 5600 NSR 100
28 5601 5800 NSR 100
29 5801 6000 NSR 100
我该怎么做?我找不到解决此问题的方法。我已经尝试过类似的东西:
df1["rhythm_name"] = df2[(df1['onset'] >= df2['onset']) & (df1['offset'] <= df2['offset'])])
我明白了:
ValueError: Can only compare identically-labeled Series objects
我制作了一个脚本来重现该问题:
df1 = pd.DataFrame()
onsets = []
for i in range(0,30):
onset = i * 200 + 1
onsets.append(onset)
df1['onset'] = onsets
df1['offset'] = df1["onset"]+200-1
df2 = {'onset': [1, 2761, 3939],
'offset': [2760, 3938, 6000],
'rhythm_name': ["NSR", "JUNCTIONAL", "NSR"],
'rhythm_code': [100, 4000, 100]}
答案 0 :(得分:4)
您可以pd.merge_asof
并掩盖第二个条件:
dfm = pd.merge_asof(df1, df2, on='onset', direction='backward', suffixes=('','_y'))
dfm[['rhythm_name', 'rhythm_code']] = (dfm[['rhythm_name', 'rhythm_code']]
.where(dfm['offset'] <= dfm['offset_y']))
dfm.drop('offset_y', axis=1)
输出:
onset offset rhythm_name rhythm_code
0 1 200 NSR 100.0
1 201 400 NSR 100.0
2 401 600 NSR 100.0
3 601 800 NSR 100.0
4 801 1000 NSR 100.0
5 1001 1200 NSR 100.0
6 1201 1400 NSR 100.0
7 1401 1600 NSR 100.0
8 1601 1800 NSR 100.0
9 1801 2000 NSR 100.0
10 2001 2200 NSR 100.0
11 2201 2400 NSR 100.0
12 2401 2600 NSR 100.0
13 2601 2800 NaN NaN
14 2801 3000 JUNCTIONAL 4000.0
15 3001 3200 JUNCTIONAL 4000.0
16 3201 3400 JUNCTIONAL 4000.0
17 3401 3600 JUNCTIONAL 4000.0
18 3601 3800 JUNCTIONAL 4000.0
19 3801 4000 NaN NaN
20 4001 4200 NSR 100.0
21 4201 4400 NSR 100.0
22 4401 4600 NSR 100.0
23 4601 4800 NSR 100.0
24 4801 5000 NSR 100.0
25 5001 5200 NSR 100.0
26 5201 5400 NSR 100.0
27 5401 5600 NSR 100.0
28 5601 5800 NSR 100.0
29 5801 6000 NSR 100.0
答案 1 :(得分:3)
如果数据不是太大,可以使用广播方法:
cond1 = df1.onset.values[:,None] >= df2.onset.values
cond2 = df1.offset.values[:,None] <= df2.offset.values
mask = (cond1&cond2)
idx = np.where(mask.any(1), mask.argmax(1), np.nan)
for col in ['rhythm_name', 'rhythm_code']:
df1[col] = df2[col].reindex(idx).values
输出:
0 1 200 NSR 100.0
1 201 400 NSR 100.0
2 401 600 NSR 100.0
3 601 800 NSR 100.0
4 801 1000 NSR 100.0
5 1001 1200 NSR 100.0
6 1201 1400 NSR 100.0
7 1401 1600 NSR 100.0
8 1601 1800 NSR 100.0
9 1801 2000 NSR 100.0
10 2001 2200 NSR 100.0
11 2201 2400 NSR 100.0
12 2401 2600 NSR 100.0
13 2601 2800 NaN NaN
14 2801 3000 JUNCTIONAL 4000.0
15 3001 3200 JUNCTIONAL 4000.0
16 3201 3400 JUNCTIONAL 4000.0
17 3401 3600 JUNCTIONAL 4000.0
18 3601 3800 JUNCTIONAL 4000.0
19 3801 4000 NaN NaN
20 4001 4200 NSR 100.0
21 4201 4400 NSR 100.0
22 4401 4600 NSR 100.0
23 4601 4800 NSR 100.0
24 4801 5000 NSR 100.0
25 5001 5200 NSR 100.0
26 5201 5400 NSR 100.0
27 5401 5600 NSR 100.0
28 5601 5800 NSR 100.0
29 5801 6000 NSR 100.0
选项2 :使用merge_asof
的另一种(更好)方法:
(pd.merge_asof(df1,df2,on='onset',direction='backward',suffixes=['','_y'])
.query('offset<=offset_y')
.reindex(df1.index)
.drop('offset_y', axis=1)
.fillna(df1)
)
您将得到相同的输出。