合并和删除重复项

时间:2017-09-29 11:33:20

标签: python pandas dataframe merge

我有两个我要合并的大型数据集,它们有一个共同的列“gene”。

所有条目在df1中都是唯一的

in [85]: df1
Out[85]: 
         gene
0       Cdk12
1     Cdk2ap1
2        Cdk7
3        Cdk8
4        Cdx2
5       Cenpa
6       Cenpa
7       Cenpa
8      Cenpc1
9       Cenpe
10      Cenpj

df2
Out[86]: 
           gene  year                           DOI
0         Cdk12  2001              10.1038/35055500
1         Cdk12  2002           10.1038/nature01266
2         Cdk12  2002        10.1074/jbc.M106813200
3         Cdk12  2003       10.1073/pnas.1633296100
4         Cdk12  2003       10.1073/pnas.2336103100
5         Cdk12  2005            10.1093/nar/gni045
6         Cdk12  2005       10.1126/science.1112014
7         Cdk12  2008         10.1101/gr.078352.108
8         Cdk12  2011  10.1371/journal.pbio.1000582
9         Cdk12  2012       10.1074/jbc.M111.321760
10        Cdk12  2016          10.1038/cdd.2015.157
11        Cdk12  2017         10.1093/cercor/bhw081
12      Cdk2ap1  2001        10.1006/geno.2001.6474
13      Cdk2ap1  2001              10.1038/35055500
14      Cdk2ap1  2002           10.1038/nature01266

我想保留df1的顺序,因为我将与其他数据集一起加入。

Dataframe 2为每个“基因”提供了许多条目,我希望每个基因只有一个。

“年”中的最新值将决定保留哪个“基因”条目。

我试过了: 将文件读入pandas,然后命名列

df1 = pd.read_csv('T1inorderforMerge.csv', header = None)
df2 = pd.read_csv('T2inorderforMerge.csv', header = None)
df1.columns = ["gene"]
df2.columns = ["gene","year","DOI"]

我已经尝试了下面代码的所有变体,即改变df的方式和顺序。

df3 = pd.merge(df1, df2, on ="gene", how="left")

我尝试了垂直和水平堆叠,这对某些人来说显而易见,但是没有用。我还尝试过很多其他杂乱的代码,但我真的很想知道如何使用pandas进行此操作。

2 个答案:

答案 0 :(得分:3)

我认为一种可能的解决方案是创建辅助列,计算gene的值然后合并对 - Cdk12中的df1Cdk12中的df2合并},第二个Cdk12,第二个Cdk12,....唯一值以经典方式合并为1对1(因为a始终为0):

df1['a'] = df1.groupby('gene').cumcount()
df2['a'] = df2.groupby('gene').cumcount()

print (df1)
       gene  a
0     Cdk12  0
1   Cdk2ap1  0
2      Cdk7  0
3      Cdk8  0
4      Cdx2  0
5     Cenpa  0
6     Cenpa  1
7     Cenpa  2
8    Cenpc1  0
9     Cenpe  0
10    Cenpj  0

print (df2)
       gene  year                           DOI   a
0     Cdk12  2001              10.1038/35055500   0
1     Cdk12  2002           10.1038/nature01266   1
2     Cdk12  2002        10.1074/jbc.M106813200   2
3     Cdk12  2003       10.1073/pnas.1633296100   3
4     Cdk12  2003       10.1073/pnas.2336103100   4
5     Cdk12  2005            10.1093/nar/gni045   5
6     Cdk12  2005       10.1126/science.1112014   6
7     Cdk12  2008         10.1101/gr.078352.108   7
8     Cdk12  2011  10.1371/journal.pbio.1000582   8
9     Cdk12  2012       10.1074/jbc.M111.321760   9
10    Cdk12  2016          10.1038/cdd.2015.157  10
11    Cdk12  2017         10.1093/cercor/bhw081  11
12  Cdk2ap1  2001        10.1006/geno.2001.6474   0
13  Cdk2ap1  2001              10.1038/35055500   1
14  Cdk2ap1  2002           10.1038/nature01266   2
df3 = pd.merge(df1, df2, on =["a","gene"], how="left").drop('a', axis=1)
print (df3)
       gene    year                     DOI
0     Cdk12  2001.0        10.1038/35055500
1   Cdk2ap1  2001.0  10.1006/geno.2001.6474
2      Cdk7     NaN                     NaN
3      Cdk8     NaN                     NaN
4      Cdx2     NaN                     NaN
5     Cenpa     NaN                     NaN
6     Cenpa     NaN                     NaN
7     Cenpa     NaN                     NaN
8    Cenpc1     NaN                     NaN
9     Cenpe     NaN                     NaN
10    Cenpj     NaN                     NaN

同时获取与NaN对不匹配的所有行的gene

但是,如果需要仅处理df1['gene']中的唯一值,则首先需要{strong> DataFrames中的drop_duplicates

df1 = df1.drop_duplicates('gene')
df2 = df2.drop_duplicates('gene')

print (df1)
      gene
0     Cdk12
1   Cdk2ap1
2      Cdk7
3      Cdk8
4      Cdx2
5     Cenpa
8    Cenpc1
9     Cenpe
10    Cenpj

print (df2)
       gene  year                     DOI
0     Cdk12  2001        10.1038/35055500
12  Cdk2ap1  2001  10.1006/geno.2001.6474
df3 = pd.merge(df1, df2, on ="gene", how="left")
print (df3)
      gene    year                     DOI
0    Cdk12  2001.0        10.1038/35055500
1  Cdk2ap1  2001.0  10.1006/geno.2001.6474
2     Cdk7     NaN                     NaN
3     Cdk8     NaN                     NaN
4     Cdx2     NaN                     NaN
5    Cenpa     NaN                     NaN
6   Cenpc1     NaN                     NaN
7    Cenpe     NaN                     NaN
8    Cenpj     NaN                     NaN

答案 1 :(得分:1)

不确定 type(df1)是什么,但是:

In [1]: df1 = ['a', 'f', 'g']

In [2]: df2 = [['a', 7, True], ['g',8, False]]

In [3]: [[inner_item for inner_item in df2 if inner_item[0] == outer_item][0] if len([inner_item for inner_item in df2 if inner_item[0] == outer_item])>0 else [outer_item,None,None] for outer_item in df1]

Out[3]: [['a', 7, True], ['f', None, None], ['g', 8, False]]