具有有限CPU /端口的Python多线程处理

时间:2014-10-29 16:58:27

标签: python multithreading python-2.7 parallel-processing

我有一个文件夹名称字典,我想在并行中处理。在每个文件夹下,我想要在系列中处理一系列文件名:

folder_file_dict = {
         folder_name : {
                         file_names_key : [file_names_array]
                       }
        }

最终,我将创建一个名为folder_name的文件夹,其中包含名称为len(folder_file_dict[folder_name][file_names_key])的文件。我有一个像这样的方法:

def process_files_in_series(file_names_array, udp_port):
    for file_name in file_names_array:
         time_consuming_method(file_name, udp_port)
         # create "file_name"

udp_ports = [123, 456, 789]

请注意上面的time_consuming_method(),由于通过UDP端口进行呼叫,这需要很长时间。我也仅限于使用上面数组中的UDP端口。因此,在我再次使用该UDP端口之前,我必须等待time_consuming_method完成UDP端口。这意味着我一次只能运行len(udp_ports)个线程。

因此,我将最终创建len(folder_file_dict.keys())个帖子,len(folder_file_dict.keys())调用process_files_in_series。我也有一个MAX_THREAD计数。我正在尝试使用QueueThreading模块,但我不确定我需要什么样的设计。如何使用队列和线程以及可能的条件来执行此操作?使用线程池的解决方案也可能有所帮助。

注意

我不是想提高读/写速度。我正在尝试将time_consuming_method下的process_files_in_series调用并行化。创建这些文件只是过程的一部分,而不是速率限制步骤。

此外,我正在寻找使用QueueThreading和可能的Condition模块或与这些模块相关的任何模块的解决方案。线程池解决方案也可能有所帮助。我不能使用进程,只能使用线程。

我也在寻找Python 2.7的解决方案。

3 个答案:

答案 0 :(得分:1)

使用线程池:

#!/usr/bin/env python2
from multiprocessing.dummy import Pool, Queue # thread pool

folder_file_dict = {
    folder_name: {
        file_names_key: file_names_array
    }
}

def process_files_in_series(file_names_array, udp_port):
    for file_name in file_names_array:
         time_consuming_method(file_name, udp_port)
         # create "file_name"
         ...

def mp_process(filenames):
    udp_port = free_udp_ports.get() # block until a free udp port is available
    args = filenames, udp_port
    try:
        return args, process_files_in_series(*args), None
    except Exception as e:
        return args, None, str(e)
    finally:
        free_udp_ports.put_nowait(udp_port)

free_udp_ports = Queue() # in general, use initializer to pass it to children
for port in udp_ports:
    free_udp_ports.put_nowait(port)
pool = Pool(number_of_concurrent_jobs) #
for args, result, error in pool.imap_unordered(mp_process, get_files_arrays()):
    if error is not None:
       print args, error

如果不同文件名数组的处理时间可能不同,我认为您不需要将线程数绑定到udp端口数。

如果我正确理解folder_file_dict的结构,那么生成文件名数组:

def get_files_arrays(folder_file_dict=folder_file_dict):
    for folder_name_dict in folder_file_dict.itervalues():
        for filenames_array in folder_name_dict.itervalues():
            yield filenames_array

答案 1 :(得分:0)

使用multiprocessing.pool.ThreadPool。它为您处理队列/线程管理,可以轻松更改为多处理。

编辑:添加了示例

这是一个例子......多个线程可能最终使用相同的udp端口。我不确定这对你来说是否有问题。

import multithreading
import multithreading.pool
import itertools

def process_files_in_series(file_names_array, udp_port):
    for file_name in file_names_array:
         time_consuming_method(file_name, udp_port)
         # create "file_name"

udp_ports = [123, 456, 789]

folder_file_dict = {
         folder_name : {
                         file_names_key : [file_names_array]
                       }
        }

def main(folder_file_dict, udp_ports):
    # number of threads - here I'm limiting to the smaller of udp_ports,
    # file lists to process and a cap I arbitrarily set to 4
    num_threads = min(len(folder_file_dict), len(udp_ports), 4)
    # the pool
    pool = multithreading.pool.ThreadPool(num_threads)
    # build files to be processed into list. You may want to do other
    # Things like join folder_name...
    file_arrays = [value['file_names_key'] for value in folder_file_dict.values()]
    # do the work
    pool.map(process_files_in_series, zip(file_arrays, itertools.cycle(udp_ports))
    pool.close()
    pool.join()

答案 2 :(得分:0)

这是如何使用multiprocessing.Process的蓝图 使用JoinableQueue将工作交给工人。你会 仍然受I / O约束但是使用Process你有真正的并发性, 这可能证明是有用的,因为线程甚至可能比慢 处理文件的普通脚本。

(请注意,这也会阻止您对笔记本电脑进行任何其他操作 如果你敢同时开始太多的过程:P)。

我试着解释一下代码 尽可能多的评论。

import traceback

from multiprocessing import Process, JoinableQueue, cpu_count

# Number if CPU's on your PC
cpus = cpu_count()

# The Worker Function. Could also be modelled as a class
def Worker(q_jobs):
    while True:
        # Try / Catch / finally may be necessary for error-prone tasks since the processes 
        # may hang forever if the task_done() method is not called.
        try:
            # Get an item from the Queue
            item = q_jobs.get()

            # At this point the data should somehow be processed

        except:
            traceback.print_exc()
        else:
            pass

        finally:
            # Inform the Queue that the Task has been done
            # Without this. The processes can not be killed
            # and will be left as Zombies afterwards
            q_jobs.task_done()


# A Joinable Queue to end the process
q_jobs = JoinableQueue()

# Create process depending on the number of CPU's
for i in range(cpus):

    # target function and arguments
    # a list of multiple arguments should not end with ',' e.g.
    # (q_jobs, 'bla')
    p = Process(target=Worker,
                args=(q_jobs,)
                )
    p.daemon = True
    p.start()

# fill Queue with Jobs
q_jobs.put(['Do'])
q_jobs.put(['Something'])

# End Process
q_jobs.join()

干杯

修改

我用Python 3编写了这个。 从打印函数中取出括号

print item

应该使这项工作适用于2.7。