Question

我有一个文件夹名称字典，我想在并行中处理。在每个文件夹下，我想要在系列中处理一系列文件名：

folder_file_dict = {
         folder_name : {
                         file_names_key : [file_names_array]
                       }
        }

最终，我将创建一个名为folder_name的文件夹，其中包含名称为len(folder_file_dict[folder_name][file_names_key])的文件。我有一个像这样的方法：

def process_files_in_series(file_names_array, udp_port):
    for file_name in file_names_array:
         time_consuming_method(file_name, udp_port)
         # create "file_name"

udp_ports = [123, 456, 789]

请注意上面的time_consuming_method()，由于通过UDP端口进行呼叫，这需要很长时间。我也仅限于使用上面数组中的UDP端口。因此，在我再次使用该UDP端口之前，我必须等待time_consuming_method完成UDP端口。这意味着我一次只能运行len(udp_ports)个线程。

因此，我将最终创建len(folder_file_dict.keys())个帖子，len(folder_file_dict.keys())调用process_files_in_series。我也有一个MAX_THREAD计数。我正在尝试使用Queue和Threading模块，但我不确定我需要什么样的设计。如何使用队列和线程以及可能的条件来执行此操作？使用线程池的解决方案也可能有所帮助。

注意

我不是想提高读/写速度。我正在尝试将time_consuming_method下的process_files_in_series调用并行化。创建这些文件只是过程的一部分，而不是速率限制步骤。

此外，我正在寻找使用Queue，Threading和可能的Condition模块或与这些模块相关的任何模块的解决方案。线程池解决方案也可能有所帮助。我不能使用进程，只能使用线程。

我也在寻找Python 2.7的解决方案。

Answer 1

使用线程池：

#!/usr/bin/env python2
from multiprocessing.dummy import Pool, Queue # thread pool

folder_file_dict = {
    folder_name: {
        file_names_key: file_names_array
    }
}

def process_files_in_series(file_names_array, udp_port):
    for file_name in file_names_array:
         time_consuming_method(file_name, udp_port)
         # create "file_name"
         ...

def mp_process(filenames):
    udp_port = free_udp_ports.get() # block until a free udp port is available
    args = filenames, udp_port
    try:
        return args, process_files_in_series(*args), None
    except Exception as e:
        return args, None, str(e)
    finally:
        free_udp_ports.put_nowait(udp_port)

free_udp_ports = Queue() # in general, use initializer to pass it to children
for port in udp_ports:
    free_udp_ports.put_nowait(port)
pool = Pool(number_of_concurrent_jobs) #
for args, result, error in pool.imap_unordered(mp_process, get_files_arrays()):
    if error is not None:
       print args, error

如果不同文件名数组的处理时间可能不同，我认为您不需要将线程数绑定到udp端口数。

如果我正确理解folder_file_dict的结构，那么生成文件名数组：

def get_files_arrays(folder_file_dict=folder_file_dict):
    for folder_name_dict in folder_file_dict.itervalues():
        for filenames_array in folder_name_dict.itervalues():
            yield filenames_array

Answer 2

使用multiprocessing.pool.ThreadPool。它为您处理队列/线程管理，可以轻松更改为多处理。

编辑：添加了示例

这是一个例子......多个线程可能最终使用相同的udp端口。我不确定这对你来说是否有问题。

import multithreading
import multithreading.pool
import itertools

def process_files_in_series(file_names_array, udp_port):
    for file_name in file_names_array:
         time_consuming_method(file_name, udp_port)
         # create "file_name"

udp_ports = [123, 456, 789]

folder_file_dict = {
         folder_name : {
                         file_names_key : [file_names_array]
                       }
        }

def main(folder_file_dict, udp_ports):
    # number of threads - here I'm limiting to the smaller of udp_ports,
    # file lists to process and a cap I arbitrarily set to 4
    num_threads = min(len(folder_file_dict), len(udp_ports), 4)
    # the pool
    pool = multithreading.pool.ThreadPool(num_threads)
    # build files to be processed into list. You may want to do other
    # Things like join folder_name...
    file_arrays = [value['file_names_key'] for value in folder_file_dict.values()]
    # do the work
    pool.map(process_files_in_series, zip(file_arrays, itertools.cycle(udp_ports))
    pool.close()
    pool.join()

Answer 3

这是如何使用multiprocessing.Process的蓝图使用JoinableQueue将工作交给工人。你会仍然受I / O约束但是使用Process你有真正的并发性，这可能证明是有用的，因为线程甚至可能比慢处理文件的普通脚本。

（请注意，这也会阻止您对笔记本电脑进行任何其他操作如果你敢同时开始太多的过程：P）。

我试着解释一下代码尽可能多的评论。

import traceback

from multiprocessing import Process, JoinableQueue, cpu_count

# Number if CPU's on your PC
cpus = cpu_count()

# The Worker Function. Could also be modelled as a class
def Worker(q_jobs):
    while True:
        # Try / Catch / finally may be necessary for error-prone tasks since the processes 
        # may hang forever if the task_done() method is not called.
        try:
            # Get an item from the Queue
            item = q_jobs.get()

            # At this point the data should somehow be processed

        except:
            traceback.print_exc()
        else:
            pass

        finally:
            # Inform the Queue that the Task has been done
            # Without this. The processes can not be killed
            # and will be left as Zombies afterwards
            q_jobs.task_done()


# A Joinable Queue to end the process
q_jobs = JoinableQueue()

# Create process depending on the number of CPU's
for i in range(cpus):

    # target function and arguments
    # a list of multiple arguments should not end with ',' e.g.
    # (q_jobs, 'bla')
    p = Process(target=Worker,
                args=(q_jobs,)
                )
    p.daemon = True
    p.start()

# fill Queue with Jobs
q_jobs.put(['Do'])
q_jobs.put(['Something'])

# End Process
q_jobs.join()

干杯

修改

我用Python 3编写了这个。从打印函数中取出括号

print item

应该使这项工作适用于2.7。

具有有限CPU /端口的Python多线程处理

3 个答案: