我有以下代码运行得很慢。这是一个拆分大文件(80演出)并将其放入树状文件夹结构以进行快速查找的程序。我在代码中做了一些注释,以帮助您理解它。
# Libraries
import os
# Variables
file="80_gig_file.txt"
outputdirectory="sorted"
depth=4 # This is the tree depth
# Preperations
os.makedirs(outputdirectory)
# Process each line in the file
def pipeline(line):
# Strip symbols from line
line_stripped=''.join(e for e in line if e.isalnum())
# Reverse the line
line_stripped_reversed=line_stripped[::-1]
file=outputdirectory
# Create path location in folderbased tree
for i in range(min((depth),len(line_stripped))):
file=os.path.join(file,line_stripped_reversed[i])
# Create folders if they don't exist
os.makedirs(os.path.dirname(file), exist_ok=True)
# Name the file, with "-file"
file=file+"-file"
# This is the operation that slows everything down.
# It opens, writes and closes a lot of small files.
# I cannot keep them open because currently half a million possibilities (and thus files) are worst case open (n=26^4).
f = open(file, "a")
f.write(line)
f.close()
# Read file line by line and by not loading it entirely in memory
# Here it is possible to work with a queue I think, but how to do it properly without loading too much in memory?
with open(file) as infile:
for line in infile:
pipeline(line)
有没有办法使多线程工作?因为我自己尝试了一些示例,所以我在网上找到了它,并将所有内容都存储在内存中,导致计算机多次冻结。
答案 0 :(得分:1)
如果看起来这些行是完全独立的,只需将文件分成N个块,将文件名作为程序参数传递给打开,然后运行当前脚本的多个实例,即可在多个命令行上手动启动它们。
这将被实现为:
APPROX_CHUNK_SIZE = 1e9 #1GB per file, adjust as needed
with open('big_file.txt') as fp:
chunk_id = 0
next_chunk = fp.readlines(APPROX_CHUNK_SIZE)
while next_chunk:
with open('big_file_{}.txt'.format(chunk_id), 'w') as ofp:
ofp.writelines(next_chunk)
chunk_id += 1
next_chunk = fp.readlines(APPROX_CHUNK_SIZE)
如果存在可选的sizehint参数,则读取的总行数大约为sizehint字节(可能在四舍五入为内部缓冲区大小之后),而不是读取EOF。
以这种方式执行操作并不能确保所有块中的行数均相等,但是由于您要逐块读取而不是逐行读取,因此可以使预处理更快。根据需要调整块大小。
另外,请注意,通过使用readlines
,我们可以确保在块之间没有断行,但是由于函数返回了行列表,因此我们使用writelines
将其写到输出文件中(这等效于遍历列表和ofp.write(line)
)。为了完整起见,请注意,您还可以在内存中连接所有字符串,并仅调用一次write
(即执行ofp.write(''.join(next_chunk))
),这可能会给您带来一些(次要)性能上的好处,支付了更多的RAM使用率。
您唯一需要的修改是在最顶部:
import sys
file=sys.argv[1]
... # rest of your script here
通过使用argv
,可以将命令行参数传递给程序(在本例中为要打开的文件)。然后,只需将脚本运行为:
python process_the_file.py big_file_0.txt
这将运行一个进程。打开多个终端,然后对每个终端使用big_file_N.txt
运行相同的命令,它们将彼此独立。
注意:我使用argv[1]
是因为对于所有程序,argv
的第一个值(即argv[0]
)始终是程序名称。
multiprocessing
解决方案第一个解决方案虽然有效,但效果并不理想,特别是如果从大小为80GB的文件开始将拥有80个文件。
一个更干净的解决方案是利用python的multiprocessing
模块(重要:NOT threading
!如果您不知道它们之间的区别,请查找“全局解释器锁”以及为什么python中的多线程不这样做)不能按照您认为的方式工作。)
这个想法是要有一个“生产者”过程,该过程可以打开大文件并将连续的行放入队列中。然后,一个“消费”进程池从队列中提取行并进行处理。
这将实现如下:
# Libraries
import os
import multiprocessing
outputdirectory="sorted"
depth=4 # This is the tree depth
# Process each line in the file
def pipeline(line):
# Strip symbols from line
line_stripped=''.join(e for e in line if e.isalnum())
# Reverse the line
line_stripped_reversed=line_stripped[::-1]
file=outputdirectory
# Create path location in folderbased tree
for i in range(min((depth),len(line_stripped))):
file=os.path.join(file,line_stripped_reversed[i])
# Create folders if they don't exist
os.makedirs(os.path.dirname(file), exist_ok=True)
# Name the file, with "-file"
file=file+"-file"
# This is the operation that slows everything down.
# It opens, writes and closes a lot of small files.
# I cannot keep them open because currently half a million possibilities (and thus files) are worst case open (n=26^4).
f = open(file, "a")
f.write(line)
f.close()
if __name__ == '__main__':
# Variables
file="80_gig_file.txt"
# Preperations
os.makedirs(outputdirectory)
pool = multiprocessing.Pool() # by default, 1 process per CPU
LINES_PER_PROCESS = 1000 # adapt as needed. Higher is better, but consumes more RAM
with open(file) as infile:
next(pool.imap(pipeline, infile, LINES_PER_PROCESS))
pool.close()
pool.join()
if __name__ == '__main__'
行是将在每个进程上运行的代码与仅在“父亲”上运行的代码分开的障碍。每个进程都定义pipeline
,但是只有父亲实际上会产生工人池并应用该功能。您可以找到有关multiprocessing.map
here
添加了关闭和联接池的功能,以防止主进程退出并杀死进程中的子进程。