我有两个我要合并的大型数据集,它们有一个共同的列“gene”。
所有条目在df1中都是唯一的
in [85]: df1
Out[85]:
gene
0 Cdk12
1 Cdk2ap1
2 Cdk7
3 Cdk8
4 Cdx2
5 Cenpa
6 Cenpa
7 Cenpa
8 Cenpc1
9 Cenpe
10 Cenpj
df2
Out[86]:
gene year DOI
0 Cdk12 2001 10.1038/35055500
1 Cdk12 2002 10.1038/nature01266
2 Cdk12 2002 10.1074/jbc.M106813200
3 Cdk12 2003 10.1073/pnas.1633296100
4 Cdk12 2003 10.1073/pnas.2336103100
5 Cdk12 2005 10.1093/nar/gni045
6 Cdk12 2005 10.1126/science.1112014
7 Cdk12 2008 10.1101/gr.078352.108
8 Cdk12 2011 10.1371/journal.pbio.1000582
9 Cdk12 2012 10.1074/jbc.M111.321760
10 Cdk12 2016 10.1038/cdd.2015.157
11 Cdk12 2017 10.1093/cercor/bhw081
12 Cdk2ap1 2001 10.1006/geno.2001.6474
13 Cdk2ap1 2001 10.1038/35055500
14 Cdk2ap1 2002 10.1038/nature01266
我想保留df1的顺序,因为我将与其他数据集一起加入。
Dataframe 2为每个“基因”提供了许多条目,我希望每个基因只有一个。
“年”中的最新值将决定保留哪个“基因”条目。
我试过了: 将文件读入pandas,然后命名列
df1 = pd.read_csv('T1inorderforMerge.csv', header = None)
df2 = pd.read_csv('T2inorderforMerge.csv', header = None)
df1.columns = ["gene"]
df2.columns = ["gene","year","DOI"]
我已经尝试了下面代码的所有变体,即改变df的方式和顺序。
df3 = pd.merge(df1, df2, on ="gene", how="left")
我尝试了垂直和水平堆叠,这对某些人来说显而易见,但是没有用。我还尝试过很多其他杂乱的代码,但我真的很想知道如何使用pandas进行此操作。
答案 0 :(得分:3)
我认为一种可能的解决方案是创建辅助列,计算gene
的值然后合并对 - Cdk12
中的df1
与Cdk12
中的df2
合并},第二个Cdk12
,第二个Cdk12
,....唯一值以经典方式合并为1对1(因为a
始终为0
):
df1['a'] = df1.groupby('gene').cumcount()
df2['a'] = df2.groupby('gene').cumcount()
print (df1)
gene a
0 Cdk12 0
1 Cdk2ap1 0
2 Cdk7 0
3 Cdk8 0
4 Cdx2 0
5 Cenpa 0
6 Cenpa 1
7 Cenpa 2
8 Cenpc1 0
9 Cenpe 0
10 Cenpj 0
print (df2)
gene year DOI a
0 Cdk12 2001 10.1038/35055500 0
1 Cdk12 2002 10.1038/nature01266 1
2 Cdk12 2002 10.1074/jbc.M106813200 2
3 Cdk12 2003 10.1073/pnas.1633296100 3
4 Cdk12 2003 10.1073/pnas.2336103100 4
5 Cdk12 2005 10.1093/nar/gni045 5
6 Cdk12 2005 10.1126/science.1112014 6
7 Cdk12 2008 10.1101/gr.078352.108 7
8 Cdk12 2011 10.1371/journal.pbio.1000582 8
9 Cdk12 2012 10.1074/jbc.M111.321760 9
10 Cdk12 2016 10.1038/cdd.2015.157 10
11 Cdk12 2017 10.1093/cercor/bhw081 11
12 Cdk2ap1 2001 10.1006/geno.2001.6474 0
13 Cdk2ap1 2001 10.1038/35055500 1
14 Cdk2ap1 2002 10.1038/nature01266 2
df3 = pd.merge(df1, df2, on =["a","gene"], how="left").drop('a', axis=1)
print (df3)
gene year DOI
0 Cdk12 2001.0 10.1038/35055500
1 Cdk2ap1 2001.0 10.1006/geno.2001.6474
2 Cdk7 NaN NaN
3 Cdk8 NaN NaN
4 Cdx2 NaN NaN
5 Cenpa NaN NaN
6 Cenpa NaN NaN
7 Cenpa NaN NaN
8 Cenpc1 NaN NaN
9 Cenpe NaN NaN
10 Cenpj NaN NaN
同时获取与NaN
对不匹配的所有行的gene
。
但是,如果需要仅处理df1['gene']
中的唯一值,则首先需要{strong> DataFrames中的drop_duplicates
:
df1 = df1.drop_duplicates('gene')
df2 = df2.drop_duplicates('gene')
print (df1)
gene
0 Cdk12
1 Cdk2ap1
2 Cdk7
3 Cdk8
4 Cdx2
5 Cenpa
8 Cenpc1
9 Cenpe
10 Cenpj
print (df2)
gene year DOI
0 Cdk12 2001 10.1038/35055500
12 Cdk2ap1 2001 10.1006/geno.2001.6474
df3 = pd.merge(df1, df2, on ="gene", how="left")
print (df3)
gene year DOI
0 Cdk12 2001.0 10.1038/35055500
1 Cdk2ap1 2001.0 10.1006/geno.2001.6474
2 Cdk7 NaN NaN
3 Cdk8 NaN NaN
4 Cdx2 NaN NaN
5 Cenpa NaN NaN
6 Cenpc1 NaN NaN
7 Cenpe NaN NaN
8 Cenpj NaN NaN
答案 1 :(得分:1)
不确定 type(df1)是什么,但是:
In [1]: df1 = ['a', 'f', 'g']
In [2]: df2 = [['a', 7, True], ['g',8, False]]
In [3]: [[inner_item for inner_item in df2 if inner_item[0] == outer_item][0] if len([inner_item for inner_item in df2 if inner_item[0] == outer_item])>0 else [outer_item,None,None] for outer_item in df1]
Out[3]: [['a', 7, True], ['f', None, None], ['g', 8, False]]