我正在查看一个带有时间段的Pandas DataFrame,试图将每个时间段与一天中的其他时间段进行比较,以查找重复预订。
该脚本需要一段时间才能运行。有更快的方法吗?
df_temp = pd.DataFrame()
for date in df_cal["date"].unique():
df_date = df_cal[df_cal["date"]==date]
for current in range(len(df_date)):
for comp in range(current+1,df_date[df_date["Start"]<df_date.iloc[current]["End"]]["Start"].idxmax()+1):
df_date.loc[comp,"Double booked"] = True
df_date.loc[current,"Double booked"] = True
df_date.loc[comp,"Time_removed"] = max(df_date.loc[comp,"Time_removed"],pd.Timedelta(min(df_date.iloc[current]["End"] - df_date.iloc[comp]["Start"],\
df_date.iloc[comp]["End"] - df_date.iloc[comp]["Start"])))
df_temp = pd.concat([df_temp,df_date])
列为[[“ MEET_ID”,“日期”,“开始”,“结束”,“已预订”,“已删除时间”]]
[[1943,
Timestamp('2017-05-01 00:00:00'),
Timestamp('2017-05-01 09:00:00'),
Timestamp('2017-05-01 09:30:00'),
False,
Timedelta('0 days 00:00:00')],
[1907,
Timestamp('2017-05-01 00:00:00'),
Timestamp('2017-05-01 10:00:00'),
Timestamp('2017-05-01 11:00:00'),
False,
Timedelta('0 days 00:00:00')],
[1913,
Timestamp('2017-05-01 00:00:00'),
Timestamp('2017-05-01 11:00:00'),
Timestamp('2017-05-01 12:00:00'),
False,
Timedelta('0 days 00:00:00')],
[1956,
Timestamp('2017-05-01 00:00:00'),
Timestamp('2017-05-01 12:00:00'),
Timestamp('2017-05-01 12:30:00'),
False,
Timedelta('0 days 00:00:00')],
[1905,
Timestamp('2017-05-01 00:00:00'),
Timestamp('2017-05-01 12:30:00'),
Timestamp('2017-05-01 13:00:00'),
False,
Timedelta('0 days 00:00:00')],
[1914,
Timestamp('2017-05-01 00:00:00'),
Timestamp('2017-05-01 12:30:00'),
Timestamp('2017-05-01 13:00:00'),
False,
Timedelta('0 days 00:00:00')],
[1940,
Timestamp('2017-05-01 00:00:00'),
Timestamp('2017-05-01 13:00:00'),
Timestamp('2017-05-01 16:00:00'),
False,
Timedelta('0 days 00:00:00')],
[1958,
Timestamp('2017-05-01 00:00:00'),
Timestamp('2017-05-01 14:30:00'),
Timestamp('2017-05-01 15:30:00'),
False,
Timedelta('0 days 00:00:00')],
[1892,
Timestamp('2017-05-01 00:00:00'),
Timestamp('2017-05-01 16:00:00'),
Timestamp('2017-05-01 16:30:00'),
False,
Timedelta('0 days 00:00:00')],
[1929,
Timestamp('2017-05-01 00:00:00'),
Timestamp('2017-05-01 16:30:00'),
Timestamp('2017-05-01 17:00:00'),
False,
Timedelta('0 days 00:00:00')],
[1887,
Timestamp('2017-05-01 00:00:00'),
Timestamp('2017-05-01 17:30:00'),
Timestamp('2017-05-01 18:00:00'),
False,
Timedelta('0 days 00:00:00')]]
然后应该产生类似的结果,其中将重复预定的会议标记为这样,并将重叠时间从其中一个会议中删除(此处将其从第二个会议中删除) 列为[[“ MEET_ID”,“开始”,“结束”,“已删除时间”,“已双重预订”]]
[[1943,
Timestamp('2017-05-01 09:00:00'),
Timestamp('2017-05-01 09:30:00'),
Timedelta('0 days 00:00:00'),
False],
[1907,
Timestamp('2017-05-01 10:00:00'),
Timestamp('2017-05-01 11:00:00'),
Timedelta('0 days 00:00:00'),
False],
[1913,
Timestamp('2017-05-01 11:00:00'),
Timestamp('2017-05-01 12:00:00'),
Timedelta('0 days 00:00:00'),
False],
[1956,
Timestamp('2017-05-01 12:00:00'),
Timestamp('2017-05-01 12:30:00'),
Timedelta('0 days 00:00:00'),
False],
[1905,
Timestamp('2017-05-01 12:30:00'),
Timestamp('2017-05-01 13:00:00'),
Timedelta('0 days 00:00:00'),
False],
[1914,
Timestamp('2017-05-01 12:30:00'),
Timestamp('2017-05-01 13:00:00'),
Timedelta('0 days 00:30:00'),
True],
[1940,
Timestamp('2017-05-01 13:00:00'),
Timestamp('2017-05-01 16:00:00'),
Timedelta('0 days 00:00:00'),
True],
[1958,
Timestamp('2017-05-01 14:30:00'),
Timestamp('2017-05-01 15:30:00'),
Timedelta('0 days 01:00:00'),
True],
[1892,
Timestamp('2017-05-01 16:00:00'),
Timestamp('2017-05-01 16:30:00'),
Timedelta('0 days 00:00:00'),
False],
[1929,
Timestamp('2017-05-01 16:30:00'),
Timestamp('2017-05-01 17:00:00'),
Timedelta('0 days 00:00:00'),
False],
[1887,
Timestamp('2017-05-01 17:30:00'),
Timestamp('2017-05-01 18:00:00'),
Timedelta('0 days 00:00:00'),
False]]
编辑新数据09/07/2018:
Start End Time_removed Double booked
77 2018-07-02 00:00:00 2018-07-02 10:00:00 00:00:00 True
78 2018-07-02 03:00:00 2018-07-02 08:00:00 05:00:00 True
79 2018-07-02 03:00:00 2018-07-02 08:00:00 05:00:00 True
80 2018-07-02 04:30:00 2018-07-02 09:30:00 03:30:00 True
81 2018-07-02 05:00:00 2018-07-02 10:00:00 04:30:00 True
82 2018-07-02 05:00:00 2018-07-02 10:00:00 05:00:00 True
第80行应该删除5个小时,但只能删除3:30,因为它与之前的一行比较。它必须先前已在第77行和第80行之间计算了Time_removed,但是随后它被较小的timediff取代。
答案 0 :(得分:1)
看起来像是DataFrame.groupby
的工作。您也可以使用numpy's outer product消除内部双重for
循环。
def process_data(df):
pos = np.argwhere(np.less.outer(df['start'], df['end']))
indices = df.index[pos]
unique = indices.ravel().unique()
date_diff = np.subtract.outer(df['end'], df['start']).max(axis=0)
return pd.DataFrame(
data=np.asarray([
[True]*len(indices),
np.where(
np.isin(unique, indices[:, 1]),
date_diff,
np.NaN
)
]).T,
columns=['Double booked', 'Time_removed'],
index=unique
)
df_cal.groupby('date').apply(process_data)
无论如何,这仅基于OP的代码段,并且没有示例数据帧和示例输出(即某种单元测试),很难说它是否真的解决了问题。
此外,您还必须确保不要混淆 index 和位置。在您的问题中,您似乎混合使用.loc
和.iloc
以及range
的用法。我不确定这是否能提供您想要的结果。
从添加到OP的数据来看,'Date'
变量实际上取决于'Start'
变量(即仅是'Start'
datetime值的日期)。鉴于这种情况成立,我们可以保留groupby
的应用程序并直接应用外部乘积以获取重叠项:
overlapping = np.less_equal.outer(df['Start'], df['Start']) & np.greater.outer(df['End'], df['Start'])
overlapping &= ~np.identity(len(df), dtype=bool) # Meetings are overlapping with themselves; need to remove.
overlapping_indices = df.index[np.argwhere(overlapping)].values
df.loc[
np.unique(overlapping_indices.ravel()),
'double_booked'
] = True
df.loc[
overlapping_indices[:, 1],
'Time_removed'
] = (
np.minimum(df.loc[overlapping_indices[:, 0], 'End'], df.loc[overlapping_indices[:, 1], 'End'])
- np.maximum(df.loc[overlapping_indices[:, 0], 'Start'], df.loc[overlapping_indices[:, 1], 'Start'])
).values
但是,从示例数据来看,您不清楚如何处理将重叠的会议标记为重复预订。对于12:30:00 - 13:00:00
会议,您只标记了第二个会议,而对于13:00:00 - 16:00:00
和14:30:00 - 15:30:00
,您都标记了两个人。
为了考虑多个(> 3)重叠会议,我们需要计算所有成对会议的重叠时间,然后考虑那些实际(正)重叠的会议的最大重叠时间。以下解决方案要求按开始时间对数据进行排序:
# This requires the data frame to be sorted by `Start` time.
start_times = np.tile(df['Start'].values, (len(df), 1))
end_times = np.tile(df['End'].values, (len(df), 1))
overlap_times = np.triu(np.minimum(end_times, end_times.T) - np.maximum(start_times, start_times.T))
overlap_times[np.diag_indices(len(overlap_times))] = np.timedelta64(0)
overlap_indices = df.index[np.argwhere(overlap_times > np.timedelta64(0))]
overlaps_others_indices = np.unique(overlap_indices[:, 1])
df.loc[
np.unique(overlap_indices.ravel()),
'double_booked'
] = True
df.loc[
overlaps_others_indices,
'Time_removed'
] = pd.Series(overlap_times.max(axis=0), index=df.index)[overlaps_others_indices]