我有2个要合并的数据框。
df1.head()
subject age gender group
0 s1 23 M control
1 s2 21 F control
2 s3 48 F control
3 s4 59 F control
4 s5 47 M control
df2.head()
subject age gender group
43 s1 29 M migraine
45 s3 54 M migraine
46 s4 33 F migraine
47 s5 46 F migraine
48 s6 31 M migraine
数据框应按年龄和性别合并。
首先,我将数据集划分为子集,因此每个数据框均按性别划分。
这给了我四个数据帧:df1_M, df1_F, df2_M, and df2_F
。
我尝试使用pandas.merge_asof()
,这对男性有效。
matches = pd.merge_asof(df2_M, df1_M, on='age', direction='nearest')
subject_x age gender_x group_x subject_y gender_y group_y
0 s15 25 M migraine s15 M control
1 s12 28 M migraine s32 M control
2 s1 29 M migraine s12 M control
3 s6 31 M migraine s24 M control
4 s68 42 M migraine s33 M control
5 s3 54 M migraine s14 M control
6 s8 67 M migraine s8 M control
这可能只是我很幸运。因为当我尝试为女性做时,它给了我重复的东西。
matches = pd.merge_asof(df1_F, df2_F, on='age', tolerance=2, direction='nearest')
subject_x age gender_x group_x subject_y gender_y group_y
0 s7 19 F control s51 F migraine
1 s2 21 F control s75 F migraine
2 s38 21 F control s75 F migraine
3 s9 21 F control s75 F migraine
4 s10 21 F control s75 F migraine
5 s13 21 F control s75 F migraine
6 s27 21 F control s75 F migraine
7 s26 21 F control s75 F migraine
8 s22 21 F control s75 F migraine
9 s31 22 F control s75 F migraine
10 s17 23 F control s71 F migraine
11 s40 25 F control s14 F migraine
12 s29 26 F control s14 F migraine
13 s43 26 F control s14 F migraine
14 s41 27 F control s20 F migraine
15 s19 29 F control s20 F migraine
16 s18 37 F control s13 F migraine
17 s20 47 F control s79 F migraine
18 s3 48 F control s37 F migraine
19 s11 55 F control s30 F migraine
20 s4 59 F control s91 F migraine
您会发现subject_y有很多重复项,这种情况不应该发生。
我正在尝试从年龄差异最小的两个数据框中找到最佳的主题组合。
还有其他方法可以做到吗?
编辑:
样本数据集
df1_M = pd.DataFrame({'subject':['s1','s5','s6','s8','s12','s14','s15','s16','s21','s23','s24','s25','s28','s30','s32','s33','s34','s35','s36','s37','s39','s42'],
'age':[23,47,25,60,30,54,25,56,47,35,31,21,19,23,27,46,25,24,25,19,26,24],
'gender':['M','M','M','M','M','M','M','M','M','M','M','M','M','M','M','M','M','M','M','M','M','M','M','M','M','M','M'],
'group':['control','control','control','control','control','control','control','control','control','control','control','control','control','control','control','control','control','control','control','control','control','control']})
df1_F = pd.DataFrame({'subject':['s2','s3','s4','s7','s9','s10','s11','s13','s17','s18','s19','s20','s22','s26','s27','s29','s31','s38','s40','s41','s43'],
'age':[21,48,59,19,21,21,55,21,23,37,29,47,21,21,21,26,22,21,25,27,26],
'gender':['F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F',],
'group':['control','control','control','control','control','control','control','control','control','control','control','control','control','control','control','control','control','control','control','control','control']})
df2_M = pd.DataFrame({'subject':['s1','s3','s6','s8','s12','s15','s68'],
'age':[29,54,31,67,28,25,42],
'gender':['M','M','M','M','M','M','M'],
'group':['migraine','migraine','migraine','migraine','migraine','migraine','migraine']})
df2_F = pd.DataFrame({'subject':['s4','s5','s7','s9','s11','s13','s14','s16','s17','s18','s19','s20','s21','s22','s24','s26',
's27','s30','s32','s33','s34','s35','s36','s37','s39','s41','s44','s45','s47','s49','s51',
's52','s55','s58','s59','s60','s61','s64','s65','s66','s67','s69','s70','s71','s72','s73',
's74','s75','s76','s77','s79','s80','s81','s82','s84','s85','s86','s87','s90','s91'],
'age':[33,46,41,51,33,37,24,58,37,54,41,29,37,34,52,35,46,55,54,35,43,63,56,48,24,46,51,42,50,52,18,
76,57,40,49,59,49,24,54,50,57,51,38,23,50,42,50,21,54,41,47,48,51,56,67,43,36,64,48,59],
'gender':['F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F',
'F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F'],
'group':['migraine','migraine','migraine','migraine','migraine','migraine','migraine','migraine','migraine','migraine',
'migraine','migraine','migraine','migraine','migraine','migraine','migraine','migraine','migraine','migraine',
'migraine','migraine','migraine','migraine','migraine','migraine','migraine','migraine','migraine','migraine',
'migraine','migraine','migraine','migraine','migraine','migraine','migraine','migraine','migraine','migraine',
'migraine','migraine','migraine','migraine','migraine','migraine','migraine','migraine','migraine','migraine',
'migraine','migraine','migraine','migraine','migraine','migraine','migraine','migraine','migraine','migraine']})
答案 0 :(得分:0)
df1_F 中有多个年龄为 21 岁的科目,因此该函数将 df2_F 中的科目 s75 与年龄为 21 岁的 df1_F 中的所有科目进行匹配。 merge_asof(df1_F, df2_F, on=['age'], by = ['subject'], direction='nearest' )