熊猫merge_asof没有重复

时间:2019-11-04 15:23:25

标签: python pandas

我有2个要合并的数据框。

df1.head()

  subject  age gender    group
0      s1   23      M  control
1      s2   21      F  control
2      s3   48      F  control
3      s4   59      F  control
4      s5   47      M  control

df2.head()

       subject  age gender  group
43      s1   29      M  migraine
45      s3   54      M  migraine
46      s4   33      F  migraine
47      s5   46      F  migraine
48      s6   31      M  migraine

数据框应按年龄和性别合并。

首先,我将数据集划分为子集,因此每个数据框均按性别划分。 这给了我四个数据帧:df1_M, df1_F, df2_M, and df2_F

我尝试使用pandas.merge_asof(),这对男性有效。

matches = pd.merge_asof(df2_M, df1_M, on='age', direction='nearest')

  subject_x  age gender_x   group_x subject_y gender_y  group_y
0       s15   25        M  migraine       s15        M  control
1       s12   28        M  migraine       s32        M  control
2        s1   29        M  migraine       s12        M  control
3        s6   31        M  migraine       s24        M  control
4       s68   42        M  migraine       s33        M  control
5        s3   54        M  migraine       s14        M  control
6        s8   67        M  migraine        s8        M  control

这可能只是我很幸运。因为当我尝试为女性做时,它给了我重复的东西。

matches = pd.merge_asof(df1_F, df2_F, on='age', tolerance=2, direction='nearest')

       subject_x  age gender_x  group_x subject_y gender_y   group_y
0         s7   19        F  control       s51        F  migraine
1         s2   21        F  control       s75        F  migraine
2        s38   21        F  control       s75        F  migraine
3         s9   21        F  control       s75        F  migraine
4        s10   21        F  control       s75        F  migraine
5        s13   21        F  control       s75        F  migraine
6        s27   21        F  control       s75        F  migraine
7        s26   21        F  control       s75        F  migraine
8        s22   21        F  control       s75        F  migraine
9        s31   22        F  control       s75        F  migraine
10       s17   23        F  control       s71        F  migraine
11       s40   25        F  control       s14        F  migraine
12       s29   26        F  control       s14        F  migraine
13       s43   26        F  control       s14        F  migraine
14       s41   27        F  control       s20        F  migraine
15       s19   29        F  control       s20        F  migraine
16       s18   37        F  control       s13        F  migraine
17       s20   47        F  control       s79        F  migraine
18        s3   48        F  control       s37        F  migraine
19       s11   55        F  control       s30        F  migraine
20        s4   59        F  control       s91        F  migraine

您会发现subject_y有很多重复项,这种情况不应该发生。

我正在尝试从年龄差异最小的两个数据框中找到最佳的主题组合。

还有其他方法可以做到吗?

编辑:

样本数据集

df1_M = pd.DataFrame({'subject':['s1','s5','s6','s8','s12','s14','s15','s16','s21','s23','s24','s25','s28','s30','s32','s33','s34','s35','s36','s37','s39','s42'],
                    'age':[23,47,25,60,30,54,25,56,47,35,31,21,19,23,27,46,25,24,25,19,26,24],
                    'gender':['M','M','M','M','M','M','M','M','M','M','M','M','M','M','M','M','M','M','M','M','M','M','M','M','M','M','M'],
                    'group':['control','control','control','control','control','control','control','control','control','control','control','control','control','control','control','control','control','control','control','control','control','control']})

df1_F = pd.DataFrame({'subject':['s2','s3','s4','s7','s9','s10','s11','s13','s17','s18','s19','s20','s22','s26','s27','s29','s31','s38','s40','s41','s43'],
                    'age':[21,48,59,19,21,21,55,21,23,37,29,47,21,21,21,26,22,21,25,27,26],
                    'gender':['F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F',],
                    'group':['control','control','control','control','control','control','control','control','control','control','control','control','control','control','control','control','control','control','control','control','control']})

df2_M = pd.DataFrame({'subject':['s1','s3','s6','s8','s12','s15','s68'],
                    'age':[29,54,31,67,28,25,42],
                    'gender':['M','M','M','M','M','M','M'],
                    'group':['migraine','migraine','migraine','migraine','migraine','migraine','migraine']})

df2_F = pd.DataFrame({'subject':['s4','s5','s7','s9','s11','s13','s14','s16','s17','s18','s19','s20','s21','s22','s24','s26',
                                's27','s30','s32','s33','s34','s35','s36','s37','s39','s41','s44','s45','s47','s49','s51',
                                's52','s55','s58','s59','s60','s61','s64','s65','s66','s67','s69','s70','s71','s72','s73',
                                's74','s75','s76','s77','s79','s80','s81','s82','s84','s85','s86','s87','s90','s91'],
                    'age':[33,46,41,51,33,37,24,58,37,54,41,29,37,34,52,35,46,55,54,35,43,63,56,48,24,46,51,42,50,52,18,
                        76,57,40,49,59,49,24,54,50,57,51,38,23,50,42,50,21,54,41,47,48,51,56,67,43,36,64,48,59],
                    'gender':['F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F',
                            'F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F','F'],
                    'group':['migraine','migraine','migraine','migraine','migraine','migraine','migraine','migraine','migraine','migraine',
                        'migraine','migraine','migraine','migraine','migraine','migraine','migraine','migraine','migraine','migraine',
                        'migraine','migraine','migraine','migraine','migraine','migraine','migraine','migraine','migraine','migraine',
                        'migraine','migraine','migraine','migraine','migraine','migraine','migraine','migraine','migraine','migraine',
                        'migraine','migraine','migraine','migraine','migraine','migraine','migraine','migraine','migraine','migraine',
                        'migraine','migraine','migraine','migraine','migraine','migraine','migraine','migraine','migraine','migraine']})

1 个答案:

答案 0 :(得分:0)

它被复制了,因为你离开了 (df1_F, df2_F, on='age')

即你只匹配年龄

df1_F 中有多个年龄为 21 岁的科目,因此该函数将 df2_F 中的科目 s75 与年龄为 21 岁的 df1_F 中的所有科目进行匹配。 merge_asof(df1_F, df2_F, on=['age'], by = ['subject'], direction='nearest' )