我有一组存储在文件中的数百万个小数字
我编写了一个Python脚本,该脚本逐行从制表符分隔的文本文件中读取数字,计算提醒并将结果附加到输出文件中。由于某种原因,它会消耗大量内存(Ubuntu上解析20万个数字需要20 Gb内存)。由于频繁的写入操作,它还会冻结系统。
调整此脚本的正确方法是什么。
import os
import re
my_path = '/media/me/mSata/res/'
# output_file.open() before the first loop didn't help
for file_id in range (10,11): #10,201
filename = my_path + "in" + str(file_id) + ".txt"
fstr0 = ""+my_path +"out"+ str(file_id)+"_0.log"
fstr1 = ""+my_path +"res"+ str(file_id)+"_1.log"
with open(filename) as fp:
stats = [0] * (512)
line = fp.readline()
while line:
raw_line = line.strip()
arr_of_parsed_numbers = re.split(r'\t+', raw_line.rstrip('\t'))
for num_index in range(0, len(arr_of_parsed_numbers)):
my_number = int(arr_of_parsed_numbers[num_index])
v0 = (my_number % 257) -1 #value 257 is correct
my_number = (my_number )//257
stats[v0] += 1
v1 = my_number % 256
stats[256+v1]+=1
f0 = open(fstr0, "a")
f1 = open(fstr1, "a")
f0.write("{}\n".format(str(v0).rjust(3)))
f1.write("{}\n".format(str(v1).rjust(3)))
f0.close()
f1.close()
line=fp.readLine()
print(stats)
# tried output_file.close() here as well
print("done")
已更新: 我已经在Windows 10(Python.exe中10个 Mb 内存)和Ubuntu(消耗10个 Gb 内存)下运行了该脚本。什么会导致这种差异?成千上万的东西很多。
他的脚本在Windows 10上消耗约20Mb(看起来是
答案 0 :(得分:0)
尝试类似这样的方法。请注意,每个文件仅打开和关闭一次,并且循环每行迭代一次。
import os
import re
my_path = '/media/me/mSata/res/'
# output_file.open() before the first loop didn't help
for file_id in range (10,11): #10,201
filename = my_path + "in" + str(file_id) + ".txt"
fstr0 = ""+my_path +"out"+ str(file_id)+"_0.log"
fstr1 = ""+my_path +"res"+ str(file_id)+"_1.log"
with open(filename, "r") as fp, open(fstr0, "a") as f0, open(fstr1, "a") as f1:
stats = [0] * (512)
for line in fp:
raw_line = line.strip()
arr_of_parsed_numbers = re.split(r'\t+', raw_line.rstrip('\t'))
for num_index in range(0, len(arr_of_parsed_numbers)):
my_number = int(arr_of_parsed_numbers[num_index])
v0 = (my_number % 257) -1 #value 257 is correct
my_number = (my_number )//257
stats[v0] += 1
v1 = my_number % 256
stats[256+v1]+=1
f0.write("{}\n".format(str(v0).rjust(3)))
f1.write("{}\n".format(str(v1).rjust(3)))
print(stats)
# tried output_file.close() here as well
print("done")