存储已保存列表时的Python MemoryError

时间:2012-07-16 16:30:53

标签: python parsing bioinformatics

我是python的新手,所以如果这个例子很简单,我很抱歉。

我正在尝试编写一个简单的脚本,该脚本将两个大型数据文件(每个大约40GB)的部分压缩并提取到一个结果文件中,格式略有改变。我最初尝试使用readlines(),但是将所有文件读入内存,而我们的实例只有28GB的内存。使用sizehint参数只解析文件的一部分。

我正在迭代文件。问题是我将文本解析的输出存储在三个增长相当大的列表中,使可用内存黯然失色。我认为这只会转换为使用交换,这很好,但它只是退出时出现“MemoryError”。

这适用于小样本文件,但对我们的实际数据感到窒息。

剧本:

import sys

a = []
b = []
c = []

file1 = open(sys.argv[1],"r")
for line in file1:
    if '@' in line:
        a.append(line.lstrip('@').rstrip('\n'))
        b.append(file1.next().rstrip('\n'))
file1.close()

file2 = open(sys.argv[2],"r")
for line in file2:
    if '@' in line: 
        c.append(file2.next().rstrip('\n'))
file2.close()

file3 = open(sys.argv[3],"w")
for i in xrange(len(a)):
    file3.write("".join([">",a[i],'\n',b[i],":",c[i],"\n"]))

我在网上找到的建议创建某种数据库来存储变量,但这不应该是必需的。你有什么想法我应该怎么处理吗?

为了完整性,我正在尝试这样做(来自我们的示例测试数据:

file1: 

@Read.Salmonella_paratyphi_A_chromosome.29004.4835/1
TCGTGTACAGCATTCTTTATAGTGGAACGGTGACCGTACCGCAAAGCTGCGAAATCAACGCCGGACKIPPTCGTAG
+
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

file2:

@Read.Salmonella_paratyphi_A_chromosome.29004.4835/1
TCGTGTACAGCATTCTTTATAGTGGAACGGTGACCGTACCGCAAAGCTGCGAAATCAACGCCGGACAAACGATTCT
+
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

file3 (output):

>Read.Salmonella_paratyphi_A_chromosome.29004.4835/1
TCGTGTACAGCATTCTTTATAGTGGAACGGTGACCGTACCGCAAAGCTGCGAAATCAACGCCGGACKIPPTCGTAG:TCGTGTACAGCATTCTTTATAGTGGAACGGTGACCGTACCGCAAAGCTGCGAAATCAACGCCGGACAAACGATTCT

3 个答案:

答案 0 :(得分:1)

您可以在解析文件时写入文件,而不是将文件解析为数组(abc)吗?

像这样的伪代码:

def get_line_with_at(a):
     while a:
         line = a.readline()
         if "@" in line:
             return line.strip()


# Open all file handles
a, b, c = [open(sys.argv[x + 1]) for x in range(3)]
out = open(sys.argv[4])

while a and b and c:
    # Repeat until a, b, and file handles are exhausted
    chunk1 = get_line_with_at(a)
    chunk2 = b.next().strip()
    chunk3 = get_line_with_at(c)

     out.write(
         ">%s\n%s:%s\n" % (chunk1, chunk2, chunk3))

那样你只需加载很少的内存(理论上是4个文件句柄和当前行的内容)

答案 1 :(得分:0)

我自己没有尝试过,但似乎以下情况应该有效:

file1 = open(sys.argv[1],"r")
file2 = open(sys.argv[2],"r")
file3 = open(sys.argv[3],"w")

for line1 in file1:
    if '@' in line1:  # line1.startswith('@') is probably better here
        a=line1.lstrip('@').rstrip('\n')
        b=file1.next().rstrip('\n')
        for line2 in file2:
            if '@' in line2:
                c=file2.next().rstrip('\n')
                break
        file3.write(">%s\n%s:%s\n"%(a,b,c))

file1.close()
file2.close()
file3.close()

在这种情况下,每次只为每个文件保留一行内存...除非文件真的很长行; ^),否则应该没问题。

此外,由于您lstrip使用'@'字符,因此您可能需要考虑使用if line.startswith('@')代替if '@' in line

答案 2 :(得分:0)

这是我[第二次,更紧凑]的努力:

import sys
import itertools

def reader(fileobj, yield_at_line=False):
    for line in fileobj:
        if line.startswith('@'):
            at_line = line.lstrip('@').rstrip('\n')
            next_line = fileobj.next().rstrip('\n')
            yield (at_line, next_line) if yield_at_line else next_line

with open(sys.argv[1]) as file1, open(sys.argv[2]) as file2, open(sys.argv[3], "w") as file3:
    first = reader(file1, yield_at_line=True)
    second = reader(file2)
    for (a,b), c in itertools.izip(first, second):
        file3.write('>{}\n{}:{}\n'.format(a, b, c))

给出了

~/coding$ cat file1
@Read.Salmonella_paratyphi_A_chromosome.29004.4835/1
TCGTGTACAGCATTCTTTATAGTGGAACGGTGACCGTACCGCAAAGCTGCGAAATCAACGCCGGACKIPPTCGTAG
+
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

~/coding$ cat file2
@Read.Salmonella_paratyphi_A_chromosome.29004.4835/1
TCGTGTACAGCATTCTTTATAGTGGAACGGTGACCGTACCGCAAAGCTGCGAAATCAACGCCGGACAAACGATTCT
+
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

~/coding$ python simulwork.py file1 file2 file3
~/coding$ cat file3
>Read.Salmonella_paratyphi_A_chromosome.29004.4835/1
TCGTGTACAGCATTCTTTATAGTGGAACGGTGACCGTACCGCAAAGCTGCGAAATCAACGCCGGACKIPPTCGTAG:TCGTGTACAGCATTCTTTATAGTGGAACGGTGACCGTACCGCAAAGCTGCGAAATCAACGCCGGACAAACGATTCT