并行化Python代码

时间:2015-08-19 08:54:44

标签: python pandas python-multithreading python-multiprocessing

我编写了一个函数,它返回一个Pandas数据框(样本作为行,描述符作为列),并将输入作为肽列表(生物序列作为字符串数据)。 "创建my_function(pep_list)"将pep_list作为参数并返回数据框。它从pep_list迭代eache肽序列并计算描述符并将所有数据组合为pandas数据帧并返回df:

pep_list = [DAAAAEF,DAAAREF,DAAANEF,DAAADEF,DAAACEF,DAAAEEF,DAAAQEF,DAAAGEF,DAAAHEF,DAAAIEF,DAAALEF,DAAAKEF]

示例:

我希望将此代码与以下给定算法并行化:

1. get the number of processor available as .
    n = multiprocessing.cpu_count()

2. split the pep_list  as 
     sub_list_of_pep_list = pep_list/n 

     sub_list_of_pep_list = [[DAAAAEF,DAAAREF,DAAANEF],[DAAADEF,DAAACEF,DAAAEEF],[DAAAQEF,DAAAGEF,DAAAHEF],[DAAAIEF,DAAALEF,DAAAKEF]]

4. run "my_function()" for each core as (example if 4 cores )

     df0 = my_function(sub_list_of_pep_list[0])
     df1 = my_function(sub_list_of_pep_list[1])
     df2 = my_functonn(sub_list_of_pep_list[2])
     df3 = my_functonn(sub_list_of_pep_list[4])

5. join all df = concat[df0,df1,df2,df3] 

6. returns df with nX speed. 

请建议我使用最合适的库来实现此方法。

感谢和问候。

Updated 

通过一些阅读,我能够写下符合我期望的代码 1.没有平行化,10肽序列需要~10秒 2.使用两个过程,12肽需要约6秒 3.使用四个过程,12种肽需要约4秒

from multiprocessing import Process

def func1():
    structure_gen(pep_seq = ["DAAAAEF","DAAAREF","DAAANEF"])

def func2():
    structure_gen(pep_seq = ["DAAAQEF","DAAAGEF","DAAAHEF"])


def func3():
    structure_gen(pep_seq = ["DAAADEF","DAAALEF"])

def func4():
    structure_gen(pep_seq = ["DAAAIEF","DAAALEF"])

if __name__ == '__main__':
  p1 = Process(target=func1)
  p1.start()
  p2 = Process(target=func2)
  p2.start()
  p3 = Process(target=func1)
  p3.start()
  p4 = Process(target=func2)
  p4.start()
  p1.join()
  p2.join()
  p3.join()
  p4.join()

但是这个代码很容易使用10肽但不能实现它的PEP_list包含100万个肽

谢谢

1 个答案:

答案 0 :(得分:3)

multiprocessing.Pool.map正是您所寻找的 试试这个:

from multiprocessing import Pool

# I recommend using more partitions than processes,
# this way the work can be balanced.
# Of course this only makes sense if pep_list is bigger than
# the one you provide. If not, change this to 8 or so.
n = 50

# create indices for the partitions
ix = np.linspace(0, len(pep_list), n+1, endpoint=True, dtype=int)

# create partitions using the indices
sub_lists = [pep_list[i1:i2] for i1, i2 in zip(ix[:-1], ix[1:])]

p = Pool()
try:
    # p.map will return a list of dataframes which are to be
    # concatenated
    df = concat(p.map(my_function, sub_lists))
finally:
    p.close()

池将自动包含与可用核心一样多的进程。但是如果你愿意,你可以覆盖这个号码,只需查看文档。