Question

我有以下代码运行得很慢。这是一个拆分大文件（80演出）并将其放入树状文件夹结构以进行快速查找的程序。我在代码中做了一些注释，以帮助您理解它。

# Libraries
import os


# Variables
file="80_gig_file.txt"
outputdirectory="sorted"
depth=4 # This is the tree depth


# Preperations
os.makedirs(outputdirectory)

# Process each line in the file
def pipeline(line):
    # Strip symbols from line
    line_stripped=''.join(e for e in line if e.isalnum())
    # Reverse the line
    line_stripped_reversed=line_stripped[::-1]
    file=outputdirectory
    # Create path location in folderbased tree
    for i in range(min((depth),len(line_stripped))):
        file=os.path.join(file,line_stripped_reversed[i])
    # Create folders if they don't exist
    os.makedirs(os.path.dirname(file), exist_ok=True)
    # Name the file, with "-file"
    file=file+"-file"
    # This is the operation that slows everything down. 
    # It opens, writes and closes a lot of small files. 
    # I cannot keep them open because currently half a million possibilities (and thus files) are worst case open (n=26^4).
    f = open(file, "a")
    f.write(line)
    f.close()


# Read file line by line and by not loading it entirely in memory
# Here it is possible to work with a queue I think, but how to do it properly without loading too much in memory?
with open(file) as infile:
    for line in infile:
        pipeline(line)

有没有办法使多线程工作？因为我自己尝试了一些示例，所以我在网上找到了它，并将所有内容都存储在内存中，导致计算机多次冻结。

Answer 1

首先，最简单的（IMO）解决方案

如果看起来这些行是完全独立的，只需将文件分成N个块，将文件名作为程序参数传递给打开，然后运行当前脚本的多个实例，即可在多个命令行上手动启动它们。

优点：

无需深入研究多处理，进程间通信等
不需要太多修改代码

缺点：

您需要对大文件进行预处理以将其拆分为多个块（尽管这将比您当前的执行时间快得多，因为您不会出现按行打开/关闭的情况）
您需要自己启动进程，并为每个进程传递适当的文件名

这将被实现为：

预处理：

APPROX_CHUNK_SIZE = 1e9 #1GB per file, adjust as needed
with open('big_file.txt') as fp:
  chunk_id = 0
  next_chunk = fp.readlines(APPROX_CHUNK_SIZE)
  while next_chunk:
    with open('big_file_{}.txt'.format(chunk_id), 'w') as ofp:
      ofp.writelines(next_chunk)
    chunk_id += 1
    next_chunk = fp.readlines(APPROX_CHUNK_SIZE)

来自readlines docs：

如果存在可选的sizehint参数，则读取的总行数大约为sizehint字节（可能在四舍五入为内部缓冲区大小之后），而不是读取EOF。

以这种方式执行操作并不能确保所有块中的行数均相等，但是由于您要逐块读取而不是逐行读取，因此可以使预处理更快。根据需要调整块大小。另外，请注意，通过使用readlines，我们可以确保在块之间没有断行，但是由于函数返回了行列表，因此我们使用writelines将其写到输出文件中（这等效于遍历列表和ofp.write(line)）。为了完整起见，请注意，您还可以在内存中连接所有字符串，并仅调用一次write（即执行ofp.write(''.join(next_chunk))），这可能会给您带来一些（次要）性能上的好处，支付了更多的RAM使用率。

主脚本：

您唯一需要的修改是在最顶部：

import sys
file=sys.argv[1]
... # rest of your script here

通过使用argv，可以将命令行参数传递给程序（在本例中为要打开的文件）。然后，只需将脚本运行为：

python process_the_file.py big_file_0.txt

这将运行一个进程。打开多个终端，然后对每个终端使用big_file_N.txt运行相同的命令，它们将彼此独立。

注意：我使用argv[1]是因为对于所有程序，argv的第一个值（即argv[0]）始终是程序名称。

然后，`multiprocessing`解决方案

第一个解决方案虽然有效，但效果并不理想，特别是如果从大小为80GB的文件开始将拥有80个文件。

一个更干净的解决方案是利用python的multiprocessing模块（重要：NOT threading！如果您不知道它们之间的区别，请查找“全局解释器锁”以及为什么python中的多线程不这样做）不能按照您认为的方式工作。）

这个想法是要有一个“生产者”过程，该过程可以打开大文件并将连续的行放入队列中。然后，一个“消费”进程池从队列中提取行并进行处理。

优点：

一个脚本可以完成所有工作
无需打开多个终端并进行输入

缺点：

复杂度
使用进程间通信，这会产生一些开销

这将实现如下：

# Libraries
import os
import multiprocessing

outputdirectory="sorted"
depth=4 # This is the tree depth

# Process each line in the file
def pipeline(line):
    # Strip symbols from line
    line_stripped=''.join(e for e in line if e.isalnum())
    # Reverse the line
    line_stripped_reversed=line_stripped[::-1]
    file=outputdirectory
    # Create path location in folderbased tree
    for i in range(min((depth),len(line_stripped))):
        file=os.path.join(file,line_stripped_reversed[i])
    # Create folders if they don't exist
    os.makedirs(os.path.dirname(file), exist_ok=True)
    # Name the file, with "-file"
    file=file+"-file"
    # This is the operation that slows everything down. 
    # It opens, writes and closes a lot of small files. 
    # I cannot keep them open because currently half a million possibilities (and thus files) are worst case open (n=26^4).
    f = open(file, "a")
    f.write(line)
    f.close()

if __name__ == '__main__':
    # Variables
    file="80_gig_file.txt"

    # Preperations
    os.makedirs(outputdirectory)
    pool = multiprocessing.Pool() # by default, 1 process per CPU
    LINES_PER_PROCESS = 1000 # adapt as needed. Higher is better, but consumes more RAM

    with open(file) as infile:
        next(pool.imap(pipeline, infile, LINES_PER_PROCESS))
        pool.close()
        pool.join()

if __name__ == '__main__'行是将在每个进程上运行的代码与仅在“父亲”上运行的代码分开的障碍。每个进程都定义pipeline，但是只有父亲实际上会产生工人池并应用该功能。您可以找到有关multiprocessing.map here

的更多详细信息

编辑：

添加了关闭和联接池的功能，以防止主进程退出并杀死进程中的子进程。

在读取和处理巨大文件（对于内存太大）时使用多线程

1 个答案:

首先，最简单的（IMO）解决方案

优点：

缺点：

预处理：

主脚本：

然后，`multiprocessing`解决方案

优点：

缺点：

编辑：

在读取和处理巨大文件（对于内存太大）时使用多线程

1 个答案:

首先，最简单的（IMO）解决方案

优点：

缺点：

预处理：

主脚本：

然后，multiprocessing解决方案

优点：

缺点：

编辑：

然后，`multiprocessing`解决方案