我正在尝试使用python处理和加入大约200个文本文件(* .cov),每个大小约1GB(包含大约500万行)。 文件如下所示:
001_VA.cov
gene55970: 188/300 Percentage:62.6667 Depth:1.2
gene2664777: 0/456 Percentage:0 Depth:nan
gene1537538: 407/606 Percentage:67 Depth:2
gene3911524: 0/315 Percentage:0 Depth:nan
gene55971: 185/264 Percentage:70 Depth:1.8
gene1537539: 0/621 Percentage:0 Depth:nan
gene1268688: 1687/3573 Percentage:47 Depth:2
gene3911521: 0/315 Percentage:0 Depth:nan
gene1268689: 0/744 Percentage:0 Depth:nan
gene3911522: 0/165 Percentage:0 Depth:nan
015_MA.cov
gene55970: 0/300 Percentage:0 Depth:nan
gene2664777: 0/456 Percentage:0 Depth:nan
gene1537538: 0/606 Percentage:0 Depth:nan
gene3911524: 0/315 Percentage:0 Depth:nan
gene55971: 113/264 Percentage:43 Depth:1
gene1537539: 0/621 Percentage:0 Depth:nan
gene1268688: 2298/3573 Percentage:64 Depth:2.4
gene3911521: 0/315 Percentage:0 Depth:nan
gene1268689: 0/744 Percentage:0 Depth:nan
gene3911522: 0/165 Percentage:0 Depth:nan
079_MC.cov
gene55970: 0/300 Percentage:0 Depth:nan
gene2664777: 0/456 Percentage:0 Depth:nan
gene1537538: 0/606 Percentage:0 Depth:nan
gene3911524: 0/315 Percentage:0 Depth:nan
gene55971: 0/264 Percentage:0 Depth:nan
gene1537539: 0/621 Percentage:0 Depth:nan
gene1268688: 1372/3573 Percentage:38 Depth:1.3
gene3911521: 0/315 Percentage:0 Depth:nan
gene1268689: 0/744 Percentage:0 Depth:nan
gene3911522: 0/165 Percentage:0 Depth:nan
输出应该是制表符分隔矩阵
merged.tsv
gene length 001_VA.cov 015_MA.cov 079_MC.cov
gene1268686 654 0 0 0
gene1268687 1401 0 0 0
gene1268688 3573 2 2.4 1.3
gene1268689 744 0 0 0
gene1537538 606 2 0 0
gene1537539 621 0 0 0
gene1859184 264 0 0 0
gene1859185 759 1 0 2.7
gene1859186 138 3.8 0 1
使用这些文件的较小子集时,我可以使用我的脚本处理它们。但是,当我一起处理所有事情时,任务变得非常不可能。 我需要一个脚本的帮助,该脚本可以有效地执行此任务,并希望在多线程模式下。
这是我的代码:
#!/usr/bin/env python
import sys, os, glob, re, csv
enter code here
my_dict = {}
os.chdir("/path/to/files")
headers = []
x = "".join(("{}\t{}".format("gene","length")))
headers.append(x)
for file in sorted(glob.glob("*.cov")):
header = str(file)
headers.append(header)
f = open(file,'r')
lines = f.readlines()
f.close()
for line in lines:
if line.startswith('gene'):
line = line.strip()
line = re.split('\t|:|/',line)
#print(line)
gene = str(line[0])
length = str(line[2])
key = "".join(("{}\t{}\t".format(gene,length)))
if line[6] == 'nan':
value = 0
else:
value = line[6]
if my_dict.get(key,0)==0:
my_dict[key] = [value]
else:my_dict[key].append(str(value))
else:
next
with open("merged.tsv", "w") as outfile:
outfile.write('\t'.join(headers[0:])+'\n')
for key in sorted(my_dict.keys()):
outfile.write(str(key) + '\t'.join(str(i) for i in my_dict[key]) + '\n')
outfile.write('\n')
outfile.close()