达斯达(Dask):比熊猫运行时间更长

时间:2018-06-23 04:43:38

标签: python pandas dask

编辑以创建一个独立的示例

我在让Dask帮助加快代码速度方面遇到了一些问题。我有一个990万行X 18列的数据框,其中有275万个groupby对象使用 数据框中的 tour_id 列。我想在这些groupby对象(其作用类似于数据框)上并行运行自定义函数 因此使用Dask是不错的选择。

自定义函数为组中的每一行生成两个值(开始时间和到达时间)。除了每组的第一行,这两个值 取决于前一行中建立的值和预定义的采样分布。因此,遍历一个组似乎是唯一的选择。可能没有Dask可以更有效地做到这一点,我对此深信不疑。

现在解决眼前的问题。我正在展示所有数据框的精简版本以重现这种情况。

首先,主要数据帧称为temp_df1。来自Dask的最终输出将与此数据连接起来,并且还可以作为输入的输入 Dask调用的功能。

import pandas as pd
temp_df1 = pd.DataFrame({'trip_id':[22186702,22186703,22186704,26777219,26777220,26777221,26777222,26777223],
            'tour_id':[13525325,13525325,13525325,13525328,13525328,13525328,13525328,13525328],
            'start_time':[8,0,0,10.92,0,0,0,0],
            'ttime_mins':[3.810553,4.649286,2.917499,5.415158,3.800613,1.829472,1.829472,8.643289],
            'arrival_time':[8.063509,0,0,11.010253,0,0,0,0],
            'weight_column':['HBO_outbound','HBO_outbound','HBO_inbound','HBO_outbound','HBO_outbound','NHB_outbound','NHB_inbound','HBM_inbound']})

第二,时间采样数据声望。该数据帧存储按时间(一天中的小时)和weight_segment(HBO_inbound,HBM_inbound等)分层的所有观察到的权重。 函数(fun4或fun6)从此数据框中为每个组的第一行以外的每一行采样一个小时,并按列进行细分。

time_dist = pd.DataFrame({'Time':[8,9,10,11,12,13,14],
            'HBO_outbound':[1573,419,339,544,600,453,100],
            'HBO_inbound':[1573,419,339,544,100,953,800],
            'HBM_outbound':[1573,419,339,544,640,463,90],
            'HBM_inbound':[1573,419,339,544,320,453,100],
            'WBO_outbound':[1573,419,339,544,600,453,100],
            'WBO_inbound':[1573,419,339,544,450,803,190],
            'NHB_outbound':[1573,419,339,544,901,543,290],
            'NHB_inbound':[1573,419,339,544,863,453,330]})

现在功能。当在dask中运行groupby并应用组合时,我正在使用功能4。

def fun4(df):
    """
    Args: dataframe (temp_df1)
    """
    #### loop through the dataframe supplied
    for i in range(0, df.shape[0]):
        if i == 0:
            start_time = df['start_time'].iloc[i]  ### get the start_time of the first row that is precomputed
            arrival_time = df['arrival_time'].iloc[i]  ### get the arrival_time of the first row that is precomputed
            tour_id = df['tour_id'].iloc[i] ### get the name of the tour being solved
            results_frow.append(start_time)
            results_frow.append(arrival_time)
            results_frow.append(tour_id)
        else:
            tour_id = df['tour_id'].iloc[i] ### get the name of the tour being solved
            arrival_time_prev = results_frow[-2] ### get the arrival time of the previous row as this serves as a constraint
            time_dist1 = time_dist.loc[time_dist['Time'] >= arrival_time_prev] ### slice the time distributions before sampling
            weight_column = df['weight_column'].iloc[i] ### get weight column to sample from

            #### sample a time and calculate a new arrival time as a result
            if len(time_dist1)> 0:
                start_time = time_dist1.sample(n=1, weights= time_dist1[weight_column], replace=True, random_state=prng)
                start_time = start_time[['Time']].values ###
                start_time = start_time[0][0]
            else:
                start_time = results_frow[-2]

            newarrival_time = start_time + df['ttime_mins'].iloc[i]/60 ### caluclate the arrival time by adding start time to the travel time 
            results_frow.append(start_time)
            results_frow.append(newarrival_time)
            results_frow.append(tour_id)

    return  (pd.DataFrame({'start_time': results_frow[0::3],
                              'arrival_time': results_frow[1::3],
                              'tour_id': results_frow[2::3]}))

在使用map_partitions时,我使用函数6。对于我认为需要制作的两行groupby代码,逻辑是相同的 map_partitions正常工作。

def fun6(in_df):
    """
    Args: dataframe (temp_df1)

    """
    results_frow1 = []
    for name, df in in_df.groupby('tour_id'):
        results_frow = []

        for i in range(0, df.shape[0]):

            if i == 0:
                start_time = df['start_time'].iloc[i]  ### get the start_time of the first row that is precomputed
                arrival_time = df['arrival_time'].iloc[i]  ### get the arrival_time of the first row that is precomputed
                tour_id = df['tour_id'].iloc[i] ### get the name of the tour being solved
                results_frow.append(start_time)
                results_frow.append(arrival_time)
                results_frow.append(tour_id)

            else:
                tour_id = df['tour_id'].iloc[i] ### get the name of the tour being solved
                arrival_time_prev = results_frow[-2] ### get the arrival time of the previous row as this serves as a constraint
                time_dist1 = time_dist.loc[time_dist['Time'] >= arrival_time_prev] ### slice the time distributions before sampling
                weight_column = df['weight_column'].iloc[i] ### get weight column to sample from

                # sample a time and calculate a new arrival time as a result
                if len(time_dist1)> 0:
                    start_time = time_dist1.sample(n=1, weights= time_dist1[weight_column], replace=True, random_state=prng)
                    start_time = start_time[['Time']].values ###
                    start_time = start_time[0][0]
                else:
                    start_time = results_frow[-2]

                newarrival_time = start_time + df['ttime_mins'].iloc[i]/60 ### caluclate the arrival time by adding start time to the travel time
                results_frow.append(start_time)
                results_frow.append(newarrival_time)
                results_frow.append(tour_id)

        results_frow1.extend(results_frow)

    return  (pd.DataFrame({'start_time': results_frow1[0::3],
                          'arrival_time': results_frow1[1::3],
                          'tour_id': results_frow1[2::3]}))

现在正在运行并进行测试。

选项1:使用groupby并应用。

import dask.dataframe as dd
results_frow = []
prng = 124

ddf = dd.from_pandas(temp_df1, npartitions=1)
gpb = ddf.groupby('tour_id').apply(fun4, meta = pd.DataFrame(dtype='float64', columns=['start_time', 'arrival_time', 'tour_id'])).compute()

输出: 虽然start_time和arrival_time的实际值是正确的,但groupby并将结果应用于比输入多的记录中。 对于13525325,应该只有3条记录,对于13525328,应该只有5行。但是从输出中可以看到,有13525325的很多重复。

tour_id     start_time  arrival_time    tour_id

    13525325 8.000000   8.063509    13525325
             9.000000   9.077488    13525325
             10.000000  10.048625   13525325
             8.000000   8.063509    13525325
             9.000000   9.077488    13525325
             10.000000  10.048625   13525325
    13525328 8.000000   8.063509    13525325
             9.000000   9.077488    13525325
             10.000000  10.048625   13525325
             8.000000   8.063509    13525325
             9.000000   9.077488    13525325
             10.000000  10.048625   13525325
             10.920000  11.010253   13525328
             12.000000  12.063344   13525328
             13.000000  13.030491   13525328
             14.000000  14.030491   13525328
             14.030491  14.174546   13525328 

选项2:使用地图分区

import dask.datafame as dd
prng = 124
results_frow = []

temp_df1['hash'] = temp_df1['tour_id'] ### setting a hash to serve as an index for my understanding is that this will be needed to split groups into partitions
temp_df1 = temp_df1.set_index('hash')
ddf = dd.from_pandas(temp_df1, npartitions=1)

gpb = ddf.map_partitions(lambda df:fun6(df), meta = pd.DataFrame(dtype='float64', columns=['start_time', 'arrival_time', 'tour_id'])).compute()

输出: 使用map_partitions的记录的值和数量正确。

    start_time  arrival_time  tour_id
0   8.000000    8.063509      13525325
1   9.000000    9.077488      13525325
2   10.000000   10.048625     13525325
3   10.920000   11.010253     13525328
4   12.000000   12.063344     13525328
5   13.000000   13.030491     13525328
6   14.000000   14.030491     13525328
7   14.030491   14.174546     13525328

时间结果如下。我还添加了熊猫版本。

Rows    Groups  Partitions  Dask - groupby  Dask-partitions Pandas
799     218     1           2.1s            1.59s           1.59s
799     218     2           2.42s           1.8s            -

问题

  • 鉴于此问题,即想在2.75M组上并行运行自定义功能,我必须使用什么:Groupby / apply或max_partitions?
  • Grouby为什么要创建所有多余的行?
  • 如何使用groupby或max_partitions显着改善运行时间?

0 个答案:

没有答案