我是python的新手,所以如果这个例子很简单,我很抱歉。
我正在尝试编写一个简单的脚本,该脚本将两个大型数据文件(每个大约40GB)的部分压缩并提取到一个结果文件中,格式略有改变。我最初尝试使用readlines(),但是将所有文件读入内存,而我们的实例只有28GB的内存。使用sizehint参数只解析文件的一部分。
我正在迭代文件。问题是我将文本解析的输出存储在三个增长相当大的列表中,使可用内存黯然失色。我认为这只会转换为使用交换,这很好,但它只是退出时出现“MemoryError”。
这适用于小样本文件,但对我们的实际数据感到窒息。
剧本:
import sys
a = []
b = []
c = []
file1 = open(sys.argv[1],"r")
for line in file1:
if '@' in line:
a.append(line.lstrip('@').rstrip('\n'))
b.append(file1.next().rstrip('\n'))
file1.close()
file2 = open(sys.argv[2],"r")
for line in file2:
if '@' in line:
c.append(file2.next().rstrip('\n'))
file2.close()
file3 = open(sys.argv[3],"w")
for i in xrange(len(a)):
file3.write("".join([">",a[i],'\n',b[i],":",c[i],"\n"]))
我在网上找到的建议创建某种数据库来存储变量,但这不应该是必需的。你有什么想法我应该怎么处理吗?
为了完整性,我正在尝试这样做(来自我们的示例测试数据:
file1:
@Read.Salmonella_paratyphi_A_chromosome.29004.4835/1
TCGTGTACAGCATTCTTTATAGTGGAACGGTGACCGTACCGCAAAGCTGCGAAATCAACGCCGGACKIPPTCGTAG
+
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
file2:
@Read.Salmonella_paratyphi_A_chromosome.29004.4835/1
TCGTGTACAGCATTCTTTATAGTGGAACGGTGACCGTACCGCAAAGCTGCGAAATCAACGCCGGACAAACGATTCT
+
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
file3 (output):
>Read.Salmonella_paratyphi_A_chromosome.29004.4835/1
TCGTGTACAGCATTCTTTATAGTGGAACGGTGACCGTACCGCAAAGCTGCGAAATCAACGCCGGACKIPPTCGTAG:TCGTGTACAGCATTCTTTATAGTGGAACGGTGACCGTACCGCAAAGCTGCGAAATCAACGCCGGACAAACGATTCT
答案 0 :(得分:1)
您可以在解析文件时写入文件,而不是将文件解析为数组(a
,b
和c
)吗?
像这样的伪代码:
def get_line_with_at(a):
while a:
line = a.readline()
if "@" in line:
return line.strip()
# Open all file handles
a, b, c = [open(sys.argv[x + 1]) for x in range(3)]
out = open(sys.argv[4])
while a and b and c:
# Repeat until a, b, and file handles are exhausted
chunk1 = get_line_with_at(a)
chunk2 = b.next().strip()
chunk3 = get_line_with_at(c)
out.write(
">%s\n%s:%s\n" % (chunk1, chunk2, chunk3))
那样你只需加载很少的内存(理论上是4个文件句柄和当前行的内容)
答案 1 :(得分:0)
我自己没有尝试过,但似乎以下情况应该有效:
file1 = open(sys.argv[1],"r")
file2 = open(sys.argv[2],"r")
file3 = open(sys.argv[3],"w")
for line1 in file1:
if '@' in line1: # line1.startswith('@') is probably better here
a=line1.lstrip('@').rstrip('\n')
b=file1.next().rstrip('\n')
for line2 in file2:
if '@' in line2:
c=file2.next().rstrip('\n')
break
file3.write(">%s\n%s:%s\n"%(a,b,c))
file1.close()
file2.close()
file3.close()
在这种情况下,每次只为每个文件保留一行内存...除非文件真的很长行; ^),否则应该没问题。
此外,由于您lstrip
使用'@'字符,因此您可能需要考虑使用if line.startswith('@')
代替if '@' in line
。
答案 2 :(得分:0)
这是我[第二次,更紧凑]的努力:
import sys
import itertools
def reader(fileobj, yield_at_line=False):
for line in fileobj:
if line.startswith('@'):
at_line = line.lstrip('@').rstrip('\n')
next_line = fileobj.next().rstrip('\n')
yield (at_line, next_line) if yield_at_line else next_line
with open(sys.argv[1]) as file1, open(sys.argv[2]) as file2, open(sys.argv[3], "w") as file3:
first = reader(file1, yield_at_line=True)
second = reader(file2)
for (a,b), c in itertools.izip(first, second):
file3.write('>{}\n{}:{}\n'.format(a, b, c))
给出了
~/coding$ cat file1
@Read.Salmonella_paratyphi_A_chromosome.29004.4835/1
TCGTGTACAGCATTCTTTATAGTGGAACGGTGACCGTACCGCAAAGCTGCGAAATCAACGCCGGACKIPPTCGTAG
+
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
~/coding$ cat file2
@Read.Salmonella_paratyphi_A_chromosome.29004.4835/1
TCGTGTACAGCATTCTTTATAGTGGAACGGTGACCGTACCGCAAAGCTGCGAAATCAACGCCGGACAAACGATTCT
+
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
~/coding$ python simulwork.py file1 file2 file3
~/coding$ cat file3
>Read.Salmonella_paratyphi_A_chromosome.29004.4835/1
TCGTGTACAGCATTCTTTATAGTGGAACGGTGACCGTACCGCAAAGCTGCGAAATCAACGCCGGACKIPPTCGTAG:TCGTGTACAGCATTCTTTATAGTGGAACGGTGACCGTACCGCAAAGCTGCGAAATCAACGCCGGACAAACGATTCT