用python加入大文件

时间:2017-04-20 06:26:35

标签: python-2.7

我正在尝试使用python处理和加入大约200个文本文件(* .cov),每个大小约1GB(包含大约500万行)。 文件如下所示:

001_VA.cov gene55970: 188/300 Percentage:62.6667 Depth:1.2 gene2664777: 0/456 Percentage:0 Depth:nan gene1537538: 407/606 Percentage:67 Depth:2 gene3911524: 0/315 Percentage:0 Depth:nan gene55971: 185/264 Percentage:70 Depth:1.8 gene1537539: 0/621 Percentage:0 Depth:nan gene1268688: 1687/3573 Percentage:47 Depth:2 gene3911521: 0/315 Percentage:0 Depth:nan gene1268689: 0/744 Percentage:0 Depth:nan gene3911522: 0/165 Percentage:0 Depth:nan

015_MA.cov gene55970: 0/300 Percentage:0 Depth:nan gene2664777: 0/456 Percentage:0 Depth:nan gene1537538: 0/606 Percentage:0 Depth:nan gene3911524: 0/315 Percentage:0 Depth:nan gene55971: 113/264 Percentage:43 Depth:1 gene1537539: 0/621 Percentage:0 Depth:nan gene1268688: 2298/3573 Percentage:64 Depth:2.4 gene3911521: 0/315 Percentage:0 Depth:nan gene1268689: 0/744 Percentage:0 Depth:nan gene3911522: 0/165 Percentage:0 Depth:nan 079_MC.cov gene55970: 0/300 Percentage:0 Depth:nan gene2664777: 0/456 Percentage:0 Depth:nan gene1537538: 0/606 Percentage:0 Depth:nan gene3911524: 0/315 Percentage:0 Depth:nan gene55971: 0/264 Percentage:0 Depth:nan gene1537539: 0/621 Percentage:0 Depth:nan gene1268688: 1372/3573 Percentage:38 Depth:1.3 gene3911521: 0/315 Percentage:0 Depth:nan gene1268689: 0/744 Percentage:0 Depth:nan gene3911522: 0/165 Percentage:0 Depth:nan

输出应该是制表符分隔矩阵

merged.tsv gene length 001_VA.cov 015_MA.cov 079_MC.cov gene1268686 654 0 0 0 gene1268687 1401 0 0 0 gene1268688 3573 2 2.4 1.3 gene1268689 744 0 0 0 gene1537538 606 2 0 0 gene1537539 621 0 0 0 gene1859184 264 0 0 0 gene1859185 759 1 0 2.7 gene1859186 138 3.8 0 1

使用这些文件的较小子集时,我可以使用我的脚本处理它们。但是,当我一起处理所有事情时,任务变得非常不可能。 我需要一个脚本的帮助,该脚本可以有效地执行此任务,并希望在多线程模式下。

这是我的代码:

#!/usr/bin/env python
import sys, os, glob, re, csv
enter code here
my_dict = {}
os.chdir("/path/to/files")
headers = []
x = "".join(("{}\t{}".format("gene","length")))
headers.append(x)
for file in sorted(glob.glob("*.cov")):
    header = str(file)
    headers.append(header)


f = open(file,'r')
    lines = f.readlines()
    f.close()
    for line in lines:
        if line.startswith('gene'):
            line = line.strip()
            line = re.split('\t|:|/',line)
            #print(line)
            gene = str(line[0])
            length = str(line[2])
            key = "".join(("{}\t{}\t".format(gene,length)))
            if line[6] == 'nan':
                value = 0
            else:
                value = line[6]        
            if my_dict.get(key,0)==0:
                my_dict[key] = [value]
            else:my_dict[key].append(str(value))

        else:
            next

with open("merged.tsv", "w") as outfile:
    outfile.write('\t'.join(headers[0:])+'\n')
    for key in sorted(my_dict.keys()):
       outfile.write(str(key) + '\t'.join(str(i) for i in my_dict[key]) + '\n')
    outfile.write('\n')
outfile.close()

0 个答案:

没有答案