编辑以创建一个独立的示例
我在让Dask帮助加快代码速度方面遇到了一些问题。我有一个990万行X 18列的数据框,其中有275万个groupby对象使用 数据框中的 tour_id 列。我想在这些groupby对象(其作用类似于数据框)上并行运行自定义函数 因此使用Dask是不错的选择。
自定义函数为组中的每一行生成两个值(开始时间和到达时间)。除了每组的第一行,这两个值 取决于前一行中建立的值和预定义的采样分布。因此,遍历一个组似乎是唯一的选择。可能没有Dask可以更有效地做到这一点,我对此深信不疑。
现在解决眼前的问题。我正在展示所有数据框的精简版本以重现这种情况。
首先,主要数据帧称为temp_df1。来自Dask的最终输出将与此数据连接起来,并且还可以作为输入的输入 Dask调用的功能。
import pandas as pd
temp_df1 = pd.DataFrame({'trip_id':[22186702,22186703,22186704,26777219,26777220,26777221,26777222,26777223],
'tour_id':[13525325,13525325,13525325,13525328,13525328,13525328,13525328,13525328],
'start_time':[8,0,0,10.92,0,0,0,0],
'ttime_mins':[3.810553,4.649286,2.917499,5.415158,3.800613,1.829472,1.829472,8.643289],
'arrival_time':[8.063509,0,0,11.010253,0,0,0,0],
'weight_column':['HBO_outbound','HBO_outbound','HBO_inbound','HBO_outbound','HBO_outbound','NHB_outbound','NHB_inbound','HBM_inbound']})
第二,时间采样数据声望。该数据帧存储按时间(一天中的小时)和weight_segment(HBO_inbound,HBM_inbound等)分层的所有观察到的权重。 函数(fun4或fun6)从此数据框中为每个组的第一行以外的每一行采样一个小时,并按列进行细分。
time_dist = pd.DataFrame({'Time':[8,9,10,11,12,13,14],
'HBO_outbound':[1573,419,339,544,600,453,100],
'HBO_inbound':[1573,419,339,544,100,953,800],
'HBM_outbound':[1573,419,339,544,640,463,90],
'HBM_inbound':[1573,419,339,544,320,453,100],
'WBO_outbound':[1573,419,339,544,600,453,100],
'WBO_inbound':[1573,419,339,544,450,803,190],
'NHB_outbound':[1573,419,339,544,901,543,290],
'NHB_inbound':[1573,419,339,544,863,453,330]})
现在功能。当在dask中运行groupby并应用组合时,我正在使用功能4。
def fun4(df):
"""
Args: dataframe (temp_df1)
"""
#### loop through the dataframe supplied
for i in range(0, df.shape[0]):
if i == 0:
start_time = df['start_time'].iloc[i] ### get the start_time of the first row that is precomputed
arrival_time = df['arrival_time'].iloc[i] ### get the arrival_time of the first row that is precomputed
tour_id = df['tour_id'].iloc[i] ### get the name of the tour being solved
results_frow.append(start_time)
results_frow.append(arrival_time)
results_frow.append(tour_id)
else:
tour_id = df['tour_id'].iloc[i] ### get the name of the tour being solved
arrival_time_prev = results_frow[-2] ### get the arrival time of the previous row as this serves as a constraint
time_dist1 = time_dist.loc[time_dist['Time'] >= arrival_time_prev] ### slice the time distributions before sampling
weight_column = df['weight_column'].iloc[i] ### get weight column to sample from
#### sample a time and calculate a new arrival time as a result
if len(time_dist1)> 0:
start_time = time_dist1.sample(n=1, weights= time_dist1[weight_column], replace=True, random_state=prng)
start_time = start_time[['Time']].values ###
start_time = start_time[0][0]
else:
start_time = results_frow[-2]
newarrival_time = start_time + df['ttime_mins'].iloc[i]/60 ### caluclate the arrival time by adding start time to the travel time
results_frow.append(start_time)
results_frow.append(newarrival_time)
results_frow.append(tour_id)
return (pd.DataFrame({'start_time': results_frow[0::3],
'arrival_time': results_frow[1::3],
'tour_id': results_frow[2::3]}))
在使用map_partitions时,我使用函数6。对于我认为需要制作的两行groupby代码,逻辑是相同的 map_partitions正常工作。
def fun6(in_df):
"""
Args: dataframe (temp_df1)
"""
results_frow1 = []
for name, df in in_df.groupby('tour_id'):
results_frow = []
for i in range(0, df.shape[0]):
if i == 0:
start_time = df['start_time'].iloc[i] ### get the start_time of the first row that is precomputed
arrival_time = df['arrival_time'].iloc[i] ### get the arrival_time of the first row that is precomputed
tour_id = df['tour_id'].iloc[i] ### get the name of the tour being solved
results_frow.append(start_time)
results_frow.append(arrival_time)
results_frow.append(tour_id)
else:
tour_id = df['tour_id'].iloc[i] ### get the name of the tour being solved
arrival_time_prev = results_frow[-2] ### get the arrival time of the previous row as this serves as a constraint
time_dist1 = time_dist.loc[time_dist['Time'] >= arrival_time_prev] ### slice the time distributions before sampling
weight_column = df['weight_column'].iloc[i] ### get weight column to sample from
# sample a time and calculate a new arrival time as a result
if len(time_dist1)> 0:
start_time = time_dist1.sample(n=1, weights= time_dist1[weight_column], replace=True, random_state=prng)
start_time = start_time[['Time']].values ###
start_time = start_time[0][0]
else:
start_time = results_frow[-2]
newarrival_time = start_time + df['ttime_mins'].iloc[i]/60 ### caluclate the arrival time by adding start time to the travel time
results_frow.append(start_time)
results_frow.append(newarrival_time)
results_frow.append(tour_id)
results_frow1.extend(results_frow)
return (pd.DataFrame({'start_time': results_frow1[0::3],
'arrival_time': results_frow1[1::3],
'tour_id': results_frow1[2::3]}))
现在正在运行并进行测试。
选项1:使用groupby并应用。
import dask.dataframe as dd
results_frow = []
prng = 124
ddf = dd.from_pandas(temp_df1, npartitions=1)
gpb = ddf.groupby('tour_id').apply(fun4, meta = pd.DataFrame(dtype='float64', columns=['start_time', 'arrival_time', 'tour_id'])).compute()
输出: 虽然start_time和arrival_time的实际值是正确的,但groupby并将结果应用于比输入多的记录中。 对于13525325,应该只有3条记录,对于13525328,应该只有5行。但是从输出中可以看到,有13525325的很多重复。
tour_id start_time arrival_time tour_id
13525325 8.000000 8.063509 13525325
9.000000 9.077488 13525325
10.000000 10.048625 13525325
8.000000 8.063509 13525325
9.000000 9.077488 13525325
10.000000 10.048625 13525325
13525328 8.000000 8.063509 13525325
9.000000 9.077488 13525325
10.000000 10.048625 13525325
8.000000 8.063509 13525325
9.000000 9.077488 13525325
10.000000 10.048625 13525325
10.920000 11.010253 13525328
12.000000 12.063344 13525328
13.000000 13.030491 13525328
14.000000 14.030491 13525328
14.030491 14.174546 13525328
选项2:使用地图分区
import dask.datafame as dd
prng = 124
results_frow = []
temp_df1['hash'] = temp_df1['tour_id'] ### setting a hash to serve as an index for my understanding is that this will be needed to split groups into partitions
temp_df1 = temp_df1.set_index('hash')
ddf = dd.from_pandas(temp_df1, npartitions=1)
gpb = ddf.map_partitions(lambda df:fun6(df), meta = pd.DataFrame(dtype='float64', columns=['start_time', 'arrival_time', 'tour_id'])).compute()
输出: 使用map_partitions的记录的值和数量正确。
start_time arrival_time tour_id
0 8.000000 8.063509 13525325
1 9.000000 9.077488 13525325
2 10.000000 10.048625 13525325
3 10.920000 11.010253 13525328
4 12.000000 12.063344 13525328
5 13.000000 13.030491 13525328
6 14.000000 14.030491 13525328
7 14.030491 14.174546 13525328
时间结果如下。我还添加了熊猫版本。
Rows Groups Partitions Dask - groupby Dask-partitions Pandas
799 218 1 2.1s 1.59s 1.59s
799 218 2 2.42s 1.8s -
问题