File1中:
chrom start end strand gene_id gene_name
1 4763414 4764404 - ENSMUSG00000033845 Mrpl15
1 4764597 4767606 - ENSMUSG00000033845 Mrpl15
1 4764597 4766491 - ENSMUSG00000033845 Mrpl15
1 4766882 4767606 - ENSMUSG00000033845 Mrpl15
1 4767729 4772649 - ENSMUSG00000033845 Mrpl15
1 4767729 4768829 - ENSMUSG00000033845 Mrpl15
1 4767729 4775654 - ENSMUSG00000033845 Mrpl15
1 4772382 4772649 - ENSMUSG00000033845 Mrpl15
1 4772814 4774032 - ENSMUSG00000033845 Mrpl15
1 4772814 4774159 - ENSMUSG00000033845 Mrpl15
1 4772814 4775654 - ENSMUSG00000033845 Mrpl15
1 4772814 4774032 + ENSMUSG00000033845 Mrpl15
1 4774186 4775654 - ENSMUSG00000033845 Mrpl15
1 4774186 4775654 + ENSMUSG00000033845 Mrpl15
1 4774186 4775699 - ENSMUSG00000033845 Mrpl15
1 4775960 4798536 + ENSMUSG00000025903 Lypla1
1 4831213 4857551 + ENSMUSG00000025903 Lypla1
1 4831213 4857551 + ENSMUSG00000033813 Tcea1
期望的输出:
chrom start end strand gene_id gene_name
1 4763414 4764404 - ENSMUSG00000033845 Mrpl15
1 4764597 4767606 - ENSMUSG00000033845 Mrpl15
1 4764597 4766491 - ENSMUSG00000033845 Mrpl15
1 4766882 4767606 - ENSMUSG00000033845 Mrpl15
1 4767729 4772649 - ENSMUSG00000033845 Mrpl15
1 4767729 4768829 - ENSMUSG00000033845 Mrpl15
1 4767729 4775654 - ENSMUSG00000033845 Mrpl15
1 4772382 4772649 - ENSMUSG00000033845 Mrpl15
1 4772814 4774032 - ENSMUSG00000033845 Mrpl15
1 4772814 4774159 - ENSMUSG00000033845 Mrpl15
1 4772814 4775654 - ENSMUSG00000033845 Mrpl15
1 4772814 4774032 + ENSMUSG00000033845 Mrpl15
1 4774186 4775654 - ENSMUSG00000033845 Mrpl15
1 4774186 4775654 + ENSMUSG00000033845 Mrpl15
1 4774186 4775699 - ENSMUSG00000033845 Mrpl15
1 4775960 4798536 + ENSMUSG00000025903 Lypla1
1 4831213 4857551 + ENSMUSG00000025903,ENSMUSG00000033813 Lypla1,Tcea1
在这种情况下,最后一行有两个值合并在一列中,属于“1 4831213 4857551 +”,有时它可能超过两个,这是理想的方法吗?
file2["chrom"].update(dict(zip(["start", "end", "strand"]
这是正确的方法吗?
答案 0 :(得分:1)
您可以使用default dict合并您的组合。 dict键可以是符合“多值”标准的连接字符串:
from collections import defaultdict
data = """chrom start end strand gene_id gene_name
1 4774186 4775699 - ENSMUSG00000033845 Mrpl15
1 4775960 4798536 + ENSMUSG00000025903 Lypla1
1 4831213 4857551 + ENSMUSG00000025903 Lypla1
1 4831213 4857551 + ENSMUSG00000033813 Tcea1"""
result = defaultdict(list)
headers = ""
for i, line in enumerate(data.splitlines()):
if i == 0:
headers = line.split()
else:
d = dict(zip(headers, line.split()))
key = '%(chrom)s_%(start)s_%(end)s_%(strand)s' % d
result[key].append(d)
for val in result.values():
print (val)
返回:
[{'chrom': '1', 'start': '4774186', 'end': '4775699', 'strand': '-', 'gene_id': 'ENSMUSG00000033845', 'gene_name': 'Mrpl15'}]
[{'chrom': '1', 'start': '4775960', 'end': '4798536', 'strand': '+', 'gene_id': 'ENSMUSG00000025903', 'gene_name': 'Lypla1'}]
[{'chrom': '1', 'start': '4831213', 'end': '4857551', 'strand': '+', 'gene_id': 'ENSMUSG00000025903', 'gene_name': 'Lypla1'}, {'chrom': '1', 'start': '4831213', 'end': '4857551', 'strand': '+', 'gene_id': 'ENSMUSG00000033813', 'gene_name': 'Tcea1'}]
写入csv,您需要使用join加入需要合并的列:
with open('write.csv', 'w') as f:
writer = csv.writer(f, delimiter=';')
writer.writerow(headers)
for vals in result.values():
_finalRow = []
for h in headers:
if h not in ['gene_id', 'gene_name']:
_finalRow.append(vals[0][h]) # regular columns
else:
_finalRow.append(','.join([v[h] for v in vals])) # merge columns
writer.writerow(_finalRow)