我有一个DataFrame如下:
CreatedDate | ID | Target
2018-07-03 19:10:19 id1 Available
2018-07-03 19:10:20 id1 Available
2018-07-03 19:12:33 id1 Available
2018-07-03 19:12:34 id1 Not Available
2018-07-03 19:15:24 id1 Available
2018-07-03 21:23:19 id2 Available
2018-07-03 21:23:20 id2 Not Available
2018-07-03 21:56:33 id2 Available
2018-07-03 22:01:34 id2 Not Available
2018-07-03 22:15:24 id2 Available
2018-07-03 22:16:24 id2 Available
2018-07-03 22:17:23 id2 Available
2018-07-03 22:17:24 id2 Available
2018-07-03 22:19:24 id2 Available
该想法是为每个组创建一个具有先前可用性的列。先前的可用性应为“目标”值,该值应接近当前的createdDate减去2分钟。
实际上,结果应如下所示:
CreatedDate | ID | Target | Previous Availability
2018-07-03 19:10:19 id1 Available NaN
2018-07-03 19:10:20 id1 Available NaN
2018-07-03 19:12:33 id1 Available Available
2018-07-03 19:12:34 id1 Not Available Available
2018-07-03 19:15:24 id1 Available Not Available
2018-07-03 21:23:19 id2 Available NaN
2018-07-03 21:23:20 id2 Not Available NaN
2018-07-03 21:56:33 id2 Available Not Available
2018-07-03 22:01:34 id2 Not Available Available
2018-07-03 22:15:24 id2 Available Not Available
2018-07-03 22:16:24 id2 Available Not Available
2018-07-03 22:17:23 id2 Available Not Available
2018-07-03 22:17:24 id2 Available Not Available
2018-07-03 22:19:24 id2 Available Available
注意:
答案 0 :(得分:0)
您可能可以定义一个自定义函数,尽管这样做效率不高。
主要思想是让每一行查找较旧的可用性(至少两分钟)并返回最后一个可用性。
def check_previous(row):
current_id = row.ID
current_time = row.CreatedDate
try:
mask = (df.ID==current_id) & (df.CreatedDate<current_time-pd.Timedelta(minutes = 2))
return df.loc[mask,'Target'].values[-1]
except:
return np.nan
df['Previous Availability'] = df.apply(check_previous,axis = 1)
编辑:
实际上,由于您必须存储越来越大的蒙版并将其应用于大数据框,因此该代码无法真正很好地扩展空间。
请注意,计算时间几乎是线性的:
def create_and_apply(n_rows):
dates = pd.to_datetime('2018-07-03 19:10:19') + np.cumsum([pd.Timedelta(seconds = delay) for delay in np.random.randint(300,size = n_rows)])
ids = np.random.choice(['id1','id2'],size = n_rows,replace = True)
targets = np.random.choice(['Available','Not Available'],size = n_rows,replace = True)
df = pd.DataFrame([x for x in zip(dates,ids,targets)],columns = ['CreatedDate','ID','Target'])
df['Previous Availability'] = df.apply(check_previous,axis = 1)
return df
%timeit create_and_apply(10)
12.4 ms ± 974 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit create_and_apply(100)
178 ms ± 49.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit create_and_apply(1000)
1.25 s ± 92.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit create_and_apply(10000)
11.1 s ± 573 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
解决此问题的一种方法是处理数据框的多个部分,例如可以按天拆分(例如,如果您在午夜前后有一段时间,则不在乎)。
df['Previous Availability'] = np.nan
df['day'] = df.CreatedDate.dt.day
for current_id in df.ID.unique():
for current_day in df.day.unique():
mask = (df.ID == current_id) & (df.day == current_day)
df.loc[mask,'Previous Availability'] = df.loc[mask].apply(check_previous,axis = 1)
df.drop('day',1,inplace = True)
通过一次处理数据帧的较小部分,这将使您在RAM上更容易。