我有一个制表符分隔的文件1,
marker1 transcript0 scaff1 1 24
marker2 transcript1 scaff2 1 53
marker3 transcript1 scaff2 1 53
marker4 transcript2 scaff3 1 89
marker5 transcript2 scaff3 1 89
marker6 transcript2 scaff3 1 89
和file2,
contig1 transcript1 scaff2 1 53
contig2 transcript1 scaff2 1 53
contig3 transcript1 scaff2 1 53
contig4 transcript2 scaff3 1 89
我想要的输出文件是,
transcript1 marker2 contig1 scaff2 1 53
transcript1 marker3 contig2 scaff2 1 53
transcript1 0 contig3 scaff2 1 53
transcript2 marker4 contig4 scaff3 1 89
transcript2 marker5 0 scaff3 1 89
transcript2 marker6 0 scaff3 1 89
基本上,如果有共同的成绩单,我需要联合两个文件。这两个文件有不同的长度。我曾尝试使用字典并加入comman行,但结果并不好。你能给我一些归纳或想法如何在python上做到这一点? 我试过加入,
join -1 2 -2 2 file1 file2
这段代码,
f1=open('file1','r')
f2=open('file2','r')
output = open('common','w')
dictA= dict()
for line1 in f1:
listA = line1.rstrip('\n').split('\t')
dictA[listA[1]] = listA
for line1 in f2:
new_list=line1.rstrip('\n').split('\t')
query=new_list[0]
subject=new_list[1]
scaff=new_list[2]
chrom=new_list[3]
cm=new_list[4]
if subject in dictA:
listA = dictA[subject]
output.write(subject+'\t'+query+'\t'+str(listA[0])+'\t'+str(listA[1])+'\t'+str(listA[2])+'\t'+str(listA[3])+'\t'+chrom+'\t'+cm+'\t'+scaff+'\n')
output.close()
答案 0 :(得分:1)
这个怎么样(Python 3):
from collections import defaultdict
from itertools import zip_longest
with open('file1', 'r') as f1, open('file2', 'r') as f2, \
open('common', 'w') as fout:
remainder = {}
markers = defaultdict(list)
for line in f1:
fields = line.split()
markers[fields[1]].append(fields[0])
remainder[fields[1]] = fields[2:]
contigs = defaultdict(list)
for line in f2:
fields = line.split()
contigs[fields[1]].append(fields[0])
remainder[fields[1]] = fields[2:]
print(remainder)
transcripts = sorted(set(markers.keys()) | set(contigs.keys()))
for transcript in transcripts:
rest = remainder[transcript]
zipped = zip_longest(markers[transcript], contigs[transcript],
fillvalue='0')
for marker, contig in zipped:
print(transcript, marker, contig, *rest, sep='\t')
输出:
transcript0 marker1 0 scaff1 1 24
transcript1 marker2 contig1 scaff2 1 53
transcript1 marker3 contig2 scaff2 1 53
transcript1 0 contig3 scaff2 1 53
transcript2 marker4 contig4 scaff3 1 89
transcript2 marker5 0 scaff3 1 89
transcript2 marker6 0 scaff3 1 89