在python中并行化for循环

时间:2018-08-12 04:07:15

标签: python python-3.x multiprocessing

我有一个字典,其中每个键(日期)都包含一个表(格式为[day1, val11, val21], [day2, va12, val22], [day3, val13, val23], ...的多个列表。我想将其转换为DataFrame;可通过以下代码完成:

df4 = pd.DataFrame(columns=sorted(set_days))

for date in dic.keys():
        days = [day  for day, val1, val2  in dic[date]]
        val1 = [val1 for day, val1, val2  in dic[date]]
        df4.loc[date, days] = val1

此代码可以正常运行,但是运行需要两个多小时。 经过一番研究,我意识到我可以通过multiprocessing库对其进行并行化。以下代码是预期的并行版本

import multiprocessing

def func(date):
    global df4, dic
    days = [day  for day, val1, val2  in dic[date]]
    val1 = [val1 for day, val1, val2  in dic[date]]
    df4.loc[date, days] = val1

multiprocessing.Pool(processes=8).map(func, dic.keys())

此代码的问题在于,执行multiprocessing.Pool(processes...后,df4 DataFrame为空。

任何帮助将不胜感激。

示例

假设字典包含两天时间:

dic['20030812'][:4]
Out: [[1, 24.25, 0.0], [20, 23.54, 23.54], [30, 23.13, 24.36], [50, 22.85, 23.57]]

dic['20030813'][:4]
Out: [[1, 24.23, 0.0], [19, 23.4, 22.82], [30, 22.97, 24.19], [49, 22.74, 23.25]]

然后,DataFrame的格式应为:

df4.loc[:, 1:50]
             1    2    3    4    5   ...     46   47   48     49     50
20030812  24.25  NaN  NaN  NaN  NaN  ...    NaN  NaN  NaN    NaN  22.85
20030813  24.23  NaN  NaN  NaN  NaN  ...    NaN  NaN  NaN  22.74    NaN

dic.keys()
Out[36]: dict_keys(['20030812', '20030813'])

df1.head().to_dict()
Out: 
{1: {'20030812': 24.25, '20030813': 24.23},
 2: {'20030812': nan, '20030813': nan},
 3: {'20030812': nan, '20030813': nan},
 4: {'20030812': nan, '20030813': nan},
 5: {'20030812': nan, '20030813': nan},
 6: {'20030812': nan, '20030813': nan},
 7: {'20030812': nan, '20030813': nan},
 8: {'20030812': nan, '20030813': nan},
 9: {'20030812': nan, '20030813': nan},
 10: {'20030812': nan, '20030813': nan},
 11: {'20030812': nan, '20030813': nan},
 12: {'20030812': nan, '20030813': nan},
 13: {'20030812': nan, '20030813': nan},
 14: {'20030812': nan, '20030813': nan},
 15: {'20030812': nan, '20030813': nan},
 16: {'20030812': nan, '20030813': nan},
 17: {'20030812': nan, '20030813': nan},
 18: {'20030812': nan, '20030813': nan},
 19: {'20030812': nan, '20030813': 23.4},
 20: {'20030812': 23.54, '20030813': nan},
 21: {'20030812': nan, '20030813': nan},
 22: {'20030812': nan, '20030813': nan},
 23: {'20030812': nan, '20030813': nan},
 24: {'20030812': nan, '20030813': nan},
 25: {'20030812': nan, '20030813': nan},
 26: {'20030812': nan, '20030813': nan},
 27: {'20030812': nan, '20030813': nan},
 28: {'20030812': nan, '20030813': nan},
 29: {'20030812': nan, '20030813': nan},
 30: {'20030812': 23.13, '20030813': 22.97},
 31: {'20030812': nan, '20030813': nan},
 32: {'20030812': nan, '20030813': nan},
 ...

2 个答案:

答案 0 :(得分:1)

要回答您最初的问题(大致为:“ df4 DataFrame为什么为空?”),此方法不起作用的原因是,在启动Pool工作程序时,每个工作程序都继承了一个父级数据的写入时个人复制视图(直接运行,如果multiprocessing在具有fork的类似UNIX的系统上运行,或者通过kludgy方法在Windows上运行时进行模拟)。< / p>

因此,当每个工人都这样做时:

 df4.loc[date, days] = val1

它正在更改工作人员的df4的个人副本;父进程的副本保持不变。

通常,有三种方法可以解决此问题:

  1. 将您的辅助函数更改为 return ,可以在父进程中使用。例如,不要尝试使用df4.loc[date, days] = val1进行就地突变,而是返回在父级执行此操作所需的操作,例如return date, days, val1,然后将父级更改为:

    for date, days, val in multiprocessing.Pool(processes=8).map(func, dic.keys()):
        df4.loc[date, days] = val
    

    这种方法的缺点是要求将每个返回值都进行腌制(Python的序列化版本),从子级通过管道传递给父级,并且不进行腌制。如果worker任务没有做很多工作,特别是如果返回值很大(在这种情况下,似乎是这种情况),则它很容易在序列化和IPC上花费更多的时间,而不是在并行性方面获得更多的时间。 / p>

  2. 使用共享的对象/内存(在this answer to "Multiprocessing writing to pandas dataframe"中演示)。实际上,这通常不会给您带来什么好处,因为并非基于the more "raw" ctypes sharing using multiprocessing.sharedctypes的内容最终仍将最终需要将数据从一个进程传送到另一个进程。尽管基于sharedctypes的东西可以大大提高速度,因为一旦映射,共享的原始C数组的访问速度几乎与本地内存一样快。

  3. 如果要并行化的工作是受I / O约束的,或者使用第三方C扩展来进行CPU约束的工作(例如numpy),尽管GIL干扰和线程 do 共享相同的内存。您的案例似乎并不受I / O约束,也不有意义地依赖于可能会释放GIL的第三方C扩展,因此在这里可能无济于事,但总的来说,从基于进程的并行性切换的简单方法基于线程的并行性(当您已经使用multiprocessing时)是将import从以下位置更改:

    import multiprocessing
    

    import multiprocessing.dummy as multiprocessing
    

    以预期的名称导入the thread-backed version of multiprocessing,因此代码从使用进程无缝转换为线程。

答案 1 :(得分:0)

正如RafaelC所暗示的,这是一个XY问题。 我已经能够在不进行多处理的情况下将执行时间减少到20秒。

我创建了一个lista列表来替换字典,并且不将每个日期添加到df4 DataFrame中,而是将lista填满后,将lista转换为DataFrame。

# Returns the largest day from  all the dates (each date has a different number of days)
def longest_series(dic):
    largest_series = 0
    for date in dic.keys():
        # get the last day's table of a specific date
        current_series = dic[date][-1][0]
        if largest_series < current_series:
            largest_series = current_series
    return largest_series


ls = longest_series(dic)
l_total_days = list(range(1, ls+1))
s_total_days = set(l_total_days)

# creating lista list, lista is similar to dic 
#The difference is that, in lista, every date has the same number of days 
#i.e. from 1 to ls, and it does not contain the dates.

# It takes 15 seconds
lista = list()
for date in dic.keys():
    present_days = list()
    presen_values = list()
    for day, val_252, _ in dic[date]:
        present_days.append(day)
        presen_values.append(val_252)

    missing_days = list(s_total_days.difference(set(present_days))) # extra days added to date
    missing_values = [None] * len(missing_days)                     # extra values added to date
    all_days_index = list(np.argsort(present_days + missing_days))  # we need to preserve the order between days and values
    all_day_values = presen_values + missing_values  
    lista.append(list(np.array(all_day_values)[all_days_index]))


# It takes 4 seconds
df = pd.DataFrame(lista, index= dic.keys(), columns=l_total_days)