有没有更快的方法来遍历DataFrame?

时间:2018-07-04 11:27:54

标签: python pandas loops dataframe

我正在查看一个带有时间段的Pandas DataFrame,试图将每个时间段与一天中的其他时间段进行比较,以查找重复预订。

该脚本需要一段时间才能运行。有更快的方法吗?

df_temp = pd.DataFrame()
for date in df_cal["date"].unique():
    df_date = df_cal[df_cal["date"]==date]
    for current in range(len(df_date)):
        for comp in range(current+1,df_date[df_date["Start"]<df_date.iloc[current]["End"]]["Start"].idxmax()+1):
            df_date.loc[comp,"Double booked"] = True
            df_date.loc[current,"Double booked"] = True
            df_date.loc[comp,"Time_removed"] = max(df_date.loc[comp,"Time_removed"],pd.Timedelta(min(df_date.iloc[current]["End"] - df_date.iloc[comp]["Start"],\
                                                           df_date.iloc[comp]["End"] - df_date.iloc[comp]["Start"])))

    df_temp = pd.concat([df_temp,df_date])

列为[[“ MEET_ID”,“日期”,“开始”,“结束”,“已预订”,“已删除时间”]]

[[1943,
  Timestamp('2017-05-01 00:00:00'),
  Timestamp('2017-05-01 09:00:00'),
  Timestamp('2017-05-01 09:30:00'),
  False,
  Timedelta('0 days 00:00:00')],
 [1907,
  Timestamp('2017-05-01 00:00:00'),
  Timestamp('2017-05-01 10:00:00'),
  Timestamp('2017-05-01 11:00:00'),
  False,
  Timedelta('0 days 00:00:00')],
 [1913,
  Timestamp('2017-05-01 00:00:00'),
  Timestamp('2017-05-01 11:00:00'),
  Timestamp('2017-05-01 12:00:00'),
  False,
  Timedelta('0 days 00:00:00')],
 [1956,
  Timestamp('2017-05-01 00:00:00'),
  Timestamp('2017-05-01 12:00:00'),
  Timestamp('2017-05-01 12:30:00'),
  False,
  Timedelta('0 days 00:00:00')],
 [1905,
  Timestamp('2017-05-01 00:00:00'),
  Timestamp('2017-05-01 12:30:00'),
  Timestamp('2017-05-01 13:00:00'),
  False,
  Timedelta('0 days 00:00:00')],
 [1914,
  Timestamp('2017-05-01 00:00:00'),
  Timestamp('2017-05-01 12:30:00'),
  Timestamp('2017-05-01 13:00:00'),
  False,
  Timedelta('0 days 00:00:00')],
 [1940,
  Timestamp('2017-05-01 00:00:00'),
  Timestamp('2017-05-01 13:00:00'),
  Timestamp('2017-05-01 16:00:00'),
  False,
  Timedelta('0 days 00:00:00')],
 [1958,
  Timestamp('2017-05-01 00:00:00'),
  Timestamp('2017-05-01 14:30:00'),
  Timestamp('2017-05-01 15:30:00'),
  False,
  Timedelta('0 days 00:00:00')],
 [1892,
  Timestamp('2017-05-01 00:00:00'),
  Timestamp('2017-05-01 16:00:00'),
  Timestamp('2017-05-01 16:30:00'),
  False,
  Timedelta('0 days 00:00:00')],
 [1929,
  Timestamp('2017-05-01 00:00:00'),
  Timestamp('2017-05-01 16:30:00'),
  Timestamp('2017-05-01 17:00:00'),
  False,
  Timedelta('0 days 00:00:00')],
 [1887,
  Timestamp('2017-05-01 00:00:00'),
  Timestamp('2017-05-01 17:30:00'),
  Timestamp('2017-05-01 18:00:00'),
  False,
  Timedelta('0 days 00:00:00')]]

然后应该产生类似的结果,其中将重复预定的会议标记为这样,并将重叠时间从其中一个会议中删除(此处将其从第二个会议中删除) 列为[[“ MEET_ID”,“开始”,“结束”,“已删除时间”,“已双重预订”]]

[[1943,
  Timestamp('2017-05-01 09:00:00'),
  Timestamp('2017-05-01 09:30:00'),
  Timedelta('0 days 00:00:00'),
  False],
 [1907,
  Timestamp('2017-05-01 10:00:00'),
  Timestamp('2017-05-01 11:00:00'),
  Timedelta('0 days 00:00:00'),
  False],
 [1913,
  Timestamp('2017-05-01 11:00:00'),
  Timestamp('2017-05-01 12:00:00'),
  Timedelta('0 days 00:00:00'),
  False],
 [1956,
  Timestamp('2017-05-01 12:00:00'),
  Timestamp('2017-05-01 12:30:00'),
  Timedelta('0 days 00:00:00'),
  False],
 [1905,
  Timestamp('2017-05-01 12:30:00'),
  Timestamp('2017-05-01 13:00:00'),
  Timedelta('0 days 00:00:00'),
  False],
 [1914,
  Timestamp('2017-05-01 12:30:00'),
  Timestamp('2017-05-01 13:00:00'),
  Timedelta('0 days 00:30:00'),
  True],
 [1940,
  Timestamp('2017-05-01 13:00:00'),
  Timestamp('2017-05-01 16:00:00'),
  Timedelta('0 days 00:00:00'),
  True],
 [1958,
  Timestamp('2017-05-01 14:30:00'),
  Timestamp('2017-05-01 15:30:00'),
  Timedelta('0 days 01:00:00'),
  True],
 [1892,
  Timestamp('2017-05-01 16:00:00'),
  Timestamp('2017-05-01 16:30:00'),
  Timedelta('0 days 00:00:00'),
  False],
 [1929,
  Timestamp('2017-05-01 16:30:00'),
  Timestamp('2017-05-01 17:00:00'),
  Timedelta('0 days 00:00:00'),
  False],
 [1887,
  Timestamp('2017-05-01 17:30:00'),
  Timestamp('2017-05-01 18:00:00'),
  Timedelta('0 days 00:00:00'),
  False]]

编辑新数据09/07/2018:

    Start               End                 Time_removed  Double booked
77  2018-07-02 00:00:00 2018-07-02 10:00:00 00:00:00      True
78  2018-07-02 03:00:00 2018-07-02 08:00:00 05:00:00      True
79  2018-07-02 03:00:00 2018-07-02 08:00:00 05:00:00      True
80  2018-07-02 04:30:00 2018-07-02 09:30:00 03:30:00      True
81  2018-07-02 05:00:00 2018-07-02 10:00:00 04:30:00      True
82  2018-07-02 05:00:00 2018-07-02 10:00:00 05:00:00      True

第80行应该删除5个小时,但只能删除3:30,因为它与之前的一行比较。它必须先前已在第77行和第80行之间计算了Time_removed,但是随后它被较小的timediff取代。

1 个答案:

答案 0 :(得分:1)

看起来像是DataFrame.groupby的工作。您也可以使用numpy's outer product消除内部双重for循环。

def process_data(df):
    pos = np.argwhere(np.less.outer(df['start'], df['end']))
    indices = df.index[pos]
    unique = indices.ravel().unique()
    date_diff = np.subtract.outer(df['end'], df['start']).max(axis=0)
    return pd.DataFrame(
        data=np.asarray([
            [True]*len(indices),
            np.where(
                np.isin(unique, indices[:, 1]),
                date_diff,
                np.NaN
            )
        ]).T,
        columns=['Double booked', 'Time_removed'],
        index=unique
    )

df_cal.groupby('date').apply(process_data)

无论如何,这仅基于OP的代码段,并且没有示例数据帧和示例输出(即某种单元测试),很难说它是否真的解决了问题。

此外,您还必须确保不要混淆 index 位置。在您的问题中,您似乎混合使用.loc.iloc以及range的用法。我不确定这是否能提供您想要的结果。

编辑

从添加到OP的数据来看,'Date'变量实际上取决于'Start'变量(即仅是'Start' datetime值的日期)。鉴于这种情况成立,我们可以保留groupby的应用程序并直接应用外部乘积以获取重叠项:

overlapping = np.less_equal.outer(df['Start'], df['Start']) & np.greater.outer(df['End'], df['Start'])
overlapping &= ~np.identity(len(df), dtype=bool)  # Meetings are overlapping with themselves; need to remove.
overlapping_indices = df.index[np.argwhere(overlapping)].values

df.loc[
    np.unique(overlapping_indices.ravel()),
    'double_booked'
] = True

df.loc[
    overlapping_indices[:, 1],
    'Time_removed'
] = (
    np.minimum(df.loc[overlapping_indices[:, 0], 'End'], df.loc[overlapping_indices[:, 1], 'End'])
    - np.maximum(df.loc[overlapping_indices[:, 0], 'Start'], df.loc[overlapping_indices[:, 1], 'Start'])
).values

但是,从示例数据来看,您不清楚如何处理将重叠的会议标记为重复预订。对于12:30:00 - 13:00:00会议,您只标记了第二个会议,而对于13:00:00 - 16:00:0014:30:00 - 15:30:00,您都标记了两个人。

编辑2

为了考虑多个(> 3)重叠会议,我们需要计算所有成对会议的重叠时间,然后考虑那些实际(正)重叠的会议的最大重叠时间。以下解决方案要求按开始时间对数据进行排序:

# This requires the data frame to be sorted by `Start` time.

start_times = np.tile(df['Start'].values, (len(df), 1))
end_times = np.tile(df['End'].values, (len(df), 1))
overlap_times = np.triu(np.minimum(end_times, end_times.T) - np.maximum(start_times, start_times.T))
overlap_times[np.diag_indices(len(overlap_times))] = np.timedelta64(0)
overlap_indices = df.index[np.argwhere(overlap_times > np.timedelta64(0))]
overlaps_others_indices = np.unique(overlap_indices[:, 1])

df.loc[
    np.unique(overlap_indices.ravel()),
    'double_booked'
] = True

df.loc[
    overlaps_others_indices,
    'Time_removed'
] = pd.Series(overlap_times.max(axis=0), index=df.index)[overlaps_others_indices]