我想使用pd.merge
合并几乎10个文件,每个文件都有类似这样的数据,
chrom start end name score strand splice_site acceptors_skipped exons_skipped donors_skipped anchor known_donor known_acceptor known_junction genes transcripts
4 3487839 3491240 JUNC00148541 101 - GT-AG 2 1 3 DA 1 1 1 Tmem68 ENSMUST00000029891,ENSMUST00000108388,ENSMUST00000154922
4 3489293 3491240 JUNC00148543 1 - GT-AG 1 0 1 DA 1 1 1 Tmem68 ENSMUST00000029891,ENSMUST00000108388,ENSMUST00000154922
我过去曾使用pd.merge(df_a, df_b, on='gene', how='outer')
只使用一列来进行合并,这里我希望根据chrom,start和end以及strand合并它们。
我的新df看起来像
chrm:start-end(strand) score_file1 score_file2 ...file10 gene_name splice_site acceptores exon_skipped donors_skipped...transcripts
如果没有匹配how='outer'
我假设将输入NaN值。什么是最好的方式来减少内存使用?
path = r'/Users/PycharmProjects/'
all_files = glob.glob(os.path.join(path, "*_bed.txt"))
print(all_files)
df1 = pd.read_table(all_files[0])
df2= pd.read_table(all_files[1])
concatnated_df = pd.merge(df1,df2, on=['genes','chrom','start','end'], how='outer')
print(concatnated_df.head(n=5))
任何帮助表示赞赏!
更新简化问题:
chr start end score strand gene
1 20 30 50 - abc1
2 40 50 50 + cdf1
带有此类数据的10个csv文件,在chr,start end和gene上合并它们(完全匹配) 新的df
chr start end score_file1 score_file2..file10 strand gene
1 20 30 50 20 40 - abc1
2 40 50 50 30 50 + cdf1
答案 0 :(得分:0)
dfs = [df1[['chr','gene','start','end','score']],
df2[['chr','gene','start','end','score']],
df3[['chr','gene','start','end','score']],
df10[['chr','gene','start','end','score']]]
df_final = reduce(lambda left,right: pd.merge(left,right,on=
['gene','chr','start','end'], how='outer'),dfs)