Question

我正在解析10万个文件中的数据，并将这些数据保存到另一个文件中以进行进一步处理。我在python中实现了多处理模块，以加快处理速度。

processes = []
for num in range(1, 5000):
    string = "{0:06}".format(num)
    path = "filename"+num+".npy"
    check_file_exist = Path(path)
    if check_file_exist.is_file():
        ## Multiprocessing for generating file using multiple cpus
        p = multiprocessing.Process(target=Get_feature_vector, args=(path,))
        processes.append(p)
        p.start()
    else:
        print("file not found", string)

    for process in processes:
        process.join()

以上代码创建错误[Errno 24] Too many open files。为了解决此错误，我该如何进行多重处理，一次只打开20-30个文件？

我曾经阅读过pool.map()上的文档，但是创建100K文件名列表超出了我的期望。我们有没有打开大量文件的有效加速方法吗？我有一台装有40个处理器的计算机。

Answer 1

如果您不想生成整个列表，请使用便捷的生成器，该生成器将产生指定数量的文件，并馈送到pool.map()：

def chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in range(0, len(l), n):
        yield l[i:i + n]

import pprint
pprint.pprint(list(chunks(range(10, 75), 10)))
[[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
 [30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
 [40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
 [50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
 [60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
 [70, 71, 72, 73, 74]]

或者您可以使用：

for chunk in chunks(range(1, 5000), 10):  # chunk size is the same as pool size = 10
    file_names = []
    for num in chunk :
        string = "{0:06}".format(num)
        path = "filename"+num+".npy"
        check_file_exist = Path(path)
        file_names.append( path )

Pool(10).map( Get_feature_vector, file_names ) # etc.

文件打开错误多处理过多

1 个答案: