使用pandas dataframe基于多个文件中的三列进行合并

时间:2017-04-12 19:14:37

标签: python dataframe merge

我想使用pd.merge合并几乎10个文件,每个文件都有类似这样的数据,

chrom   start   end name    score   strand  splice_site acceptors_skipped   exons_skipped   donors_skipped  anchor  known_donor known_acceptor  known_junction  genes   transcripts
4      3487839 3491240  JUNC00148541    101 -   GT-AG   2   1   3   DA  1   1   1   Tmem68  ENSMUST00000029891,ENSMUST00000108388,ENSMUST00000154922
4      3489293 3491240  JUNC00148543    1   -   GT-AG   1   0   1   DA  1   1   1   Tmem68  ENSMUST00000029891,ENSMUST00000108388,ENSMUST00000154922

我过去曾使用pd.merge(df_a, df_b, on='gene', how='outer')只使用一列来进行合并,这里我希望根据chrom,start和end以及strand合并它们。

我的新df看起来像

chrm:start-end(strand) score_file1 score_file2 ...file10 gene_name splice_site acceptores exon_skipped donors_skipped...transcripts 

如果没有匹配how='outer'我假设将输入NaN值。什么是最好的方式来减少内存使用?

path = r'/Users/PycharmProjects/'
all_files = glob.glob(os.path.join(path, "*_bed.txt"))
print(all_files)
df1 = pd.read_table(all_files[0])
df2= pd.read_table(all_files[1])

concatnated_df = pd.merge(df1,df2, on=['genes','chrom','start','end'], how='outer')
print(concatnated_df.head(n=5))

任何帮助表示赞赏!

更新简化问题:

chr start end score strand gene
1   20    30  50    -      abc1
2   40    50  50    +      cdf1

带有此类数据的10个csv文件,在chr,start end和gene上合并它们(完全匹配) 新的df

chr start end score_file1 score_file2..file10 strand gene
1   20    30  50  20 40   -      abc1
2   40    50  50  30 50   +      cdf1

1 个答案:

答案 0 :(得分:0)

dfs = [df1[['chr','gene','start','end','score']],
       df2[['chr','gene','start','end','score']],
       df3[['chr','gene','start','end','score']],
       df10[['chr','gene','start','end','score']]] 
df_final = reduce(lambda left,right: pd.merge(left,right,on=
                  ['gene','chr','start','end'], how='outer'),dfs)