最近我问到如何根据Count number of registers in interval中的答案按间隔计算寄存器的数量。
该解决方案效果很好,但我必须对其进行调整以考虑一些本地化密钥。
我是通过以下代码完成的:
def time_features(df, time_key, T, location_key, output_key):
"""
Create features based on time such as: how many BDs are open in the same GRA at this moment (hour)?
"""
from datetime import date
assert np.issubdtype(df[time_key], np.datetime64)
output = pd.DataFrame()
grouped = df.groupby(location_key)
for name, group in grouped:
# initialize times registers open as 1, close as -1
start_times = group.copy()
start_times[time_key] = group[time_key]-pd.Timedelta(hours=T)
start_times[output_key] = 1
aux = group.copy()
all_times = start_times.copy()
aux[output_key] = -1
all_times = all_times.append(aux, ignore_index=True)
# sort by time and perform a cumulative sum to get opened registers
# (subtract 1 since you don't want to include the current time as opened)
all_times = all_times.sort_values(by=time_key)
all_times[output_key] = all_times[output_key].cumsum() - 1
# revert the index back to original order, and truncate closed times
all_times = all_times.sort_index().iloc[:len(all_times)//2]
output = output.append(all_times, ignore_index=True)
return output
输出:
time loc1 loc2
0 2013-01-01 12:56:00 1 "a"
1 2013-01-01 12:00:12 1 "b"
2 2013-01-01 10:34:28 2 "c"
3 2013-01-01 09:34:54 2 "c"
4 2013-01-01 08:34:55 3 "d"
5 2013-01-01 08:34:55 5 "d"
6 2013-01-01 16:35:19 4 "e"
7 2013-01-01 16:35:30 4 "e"
time_features(df, time_key='time', T=2, location_key='loc1', output_key='count')
这适用于小数据,但对于较长的数据(我使用带有100万行的文件),它需要"永远"跑步。我想知道我是否能以某种方式优化这个计算。