根据列查找两个数据框的交集

时间:2019-12-12 09:05:41

标签: python pandas dataframe

考虑一下我有以下两个数据帧:

StartAsync()

我想要的是基于“开始”和“ Gene_Symbol”列查找这两个数据帧的交集,并且如果它们的“开始”和“ Gene_Symbol”与df2中的行匹配,则仅将它们保留在df1中。 例如,我希望我的结果看起来像这样:

df1:
    Composite   Beta_value  Chromosome  Start       End     Gene_Symbol
0   cg00000029  0.297449111 chr16       53434200    53434201    RBL2
1   cg00000108  0.660066803 chr3        37417715    37417716    C3orf35
2   cg00000109  0.660066803 chr3        172198247   172198248   FNDC3B
3   cg00000165  0.660066803 chr1        90729117    90729118    C3orf35
4   cg00000236  0.905679244 chr8        42405776    42405777    VDAC3



df2:     
    Composite   Beta_value  Chromosome  Start       End     Gene_Symbol
2   cg00000109  0.660066803 chr3        172198247   172198248   FNDC3B
3   cg00000165  0.660066803 chr1        90729117    90729118    C3orf35
4   cg00000236  0.905679244 chr8        42405776    42405777    VDAC3
46  cg00002116  0.017114732 chr17       81703380    81703381    MRPL12
47  cg00002145  0.780230816 chr2        237340893   237340894   COL6A3
48  cg00002190  0.781140134 chr8        19697522    19697523    CSGALNACT1
49  cg00002224  0.220786047 chr8        143038982   143038983   C8orf31

通过交集,我并不是要合并数据框并以12列结尾,就像我使用

    Composite   Beta_value  Chromosome  Start       End     Gene_Symbol
2   cg00000109  0.660066803 chr3        172198247   172198248   FNDC3B
3   cg00000165  0.660066803 chr1        90729117    90729118    C3orf35
4   cg00000236  0.905679244 chr8        42405776    42405777    VDAC3

合并了我两个数据框中的列,例如:

intersection = pd.merge(df1, df2, how='inner', on=['Start','Gene_Symbol'])
s1.dropna(inplace=True)

2 个答案:

答案 0 :(得分:1)

使用DataFrame.merge时,请确保选择正确的列,这样,df2中的所有列也不会被合并:

keys = ['Start', 'Gene_Symbol']
intersection = df1.merge(df2[keys], on=keys)
    Composite  Beta_value Chromosome      Start        End Gene_Symbol
0  cg00000109    0.660067       chr3  172198247  172198248      FNDC3B
1  cg00000165    0.660067       chr1   90729117   90729118     C3orf35
2  cg00000236    0.905679       chr8   42405776   42405777       VDAC3

答案 1 :(得分:1)

仅使用df2中的必需列。

pd.merge(df1, df2[['Start','Gene_Symbol']], on=['Start','Gene_Symbol'])