有效地找到2个数据帧的日期时间范围的重叠

时间:2018-04-25 21:43:57

标签: python pandas datetime date-range

关于在日期或时间范围(for example)中找到重叠,有一些问题。我已经使用这些来解决我的问题,但我最终得到了一个非常缓慢(并且不是优雅)解决我的问题的解决方案。如果有人知道如何让这个更快(并且更优雅),我将非常感激:

问题:

我有2个数据框public static String encode(String inputString) { try { inputString = Encryptor.toMd5Hash(inputString); } catch (NoSuchAlgorithmException e) { e.printStackTrace(); } try { if (prepare() && initCipher(Cipher.ENCRYPT_MODE)) { byte[] bytes = sCipher.doFinal(inputString.getBytes()); // length is 256 return Base64.encodeToString(bytes, Base64.NO_WRAP); } } catch (BadPaddingException e) { e.printStackTrace(); } catch (IllegalBlockSizeException e) { e.printStackTrace(); } return null; } df1,每个数据框有2列代表开始时间和结束时间:

df2

我想要的是找到>>> df1 datetime_start datetime_end 0 2016-09-11 06:00:00 2016-09-11 06:30:00 1 2016-09-11 07:00:00 2016-09-11 07:30:00 2 2016-09-11 07:30:00 2016-09-11 08:00:00 3 2016-09-11 08:00:00 2016-09-11 08:30:00 4 2016-09-11 08:30:00 2016-09-11 09:00:00 5 2016-09-11 09:00:00 2016-09-11 09:30:00 6 2016-09-11 09:30:00 2016-09-11 10:00:00 7 2016-09-11 10:30:00 2016-09-11 11:00:00 13 2016-09-11 14:00:00 2016-09-11 14:30:00 14 2016-09-11 14:30:00 2016-09-11 15:00:00 15 2016-09-11 15:00:00 2016-09-11 15:30:00 16 2016-09-11 15:30:00 2016-09-11 16:00:00 17 2016-09-11 16:00:00 2016-09-11 16:30:00 18 2016-09-11 16:30:00 2016-09-11 17:00:00 19 2016-09-11 17:00:00 2016-09-11 17:30:00 >>> df2 datetime_start datetime_end catg 4 2016-09-11 08:48:33 2016-09-11 09:41:53 a 6 2016-09-11 09:54:25 2016-09-11 10:00:50 a 8 2016-09-11 10:01:47 2016-09-11 10:04:55 b 10 2016-09-11 10:08:00 2016-09-11 10:08:11 b 12 2016-09-11 10:30:28 2016-09-11 10:30:28 b 14 2016-09-11 10:38:18 2016-09-11 10:38:18 a 18 2016-09-11 13:44:05 2016-09-11 13:44:05 a 20 2016-09-11 13:46:52 2016-09-11 14:11:41 d 23 2016-09-11 14:22:17 2016-09-11 14:33:40 b 25 2016-09-11 15:00:12 2016-09-11 15:02:55 b 27 2016-09-11 15:04:19 2016-09-11 15:06:36 b 29 2016-09-11 15:08:43 2016-09-11 15:31:29 d 31 2016-09-11 15:38:04 2016-09-11 16:09:24 a 33 2016-09-11 16:18:40 2016-09-11 16:44:32 b 35 2016-09-11 16:45:59 2016-09-11 16:59:01 b 37 2016-09-11 17:08:31 2016-09-11 17:12:23 b 39 2016-09-11 17:16:13 2016-09-11 17:16:33 c 41 2016-09-11 17:17:23 2016-09-11 17:20:00 b 45 2016-09-13 12:27:59 2016-09-13 12:34:21 a 47 2016-09-13 12:38:39 2016-09-13 12:38:45 a 中的范围与df2中的范围重叠的位置,重叠的时间长度(以秒为单位)以及df1的值df2.catg是。我希望将重叠的长度插入到df1中的列中(该列将以其代表的catg命名。)

期望的输出

>>> df1
        datetime_start        datetime_end       a       b       d     c
0  2016-09-11 06:00:00 2016-09-11 06:30:00     0.0     0.0     0.0   0.0
1  2016-09-11 07:00:00 2016-09-11 07:30:00     0.0     0.0     0.0   0.0
2  2016-09-11 07:30:00 2016-09-11 08:00:00     0.0     0.0     0.0   0.0
3  2016-09-11 08:00:00 2016-09-11 08:30:00     0.0     0.0     0.0   0.0
4  2016-09-11 08:30:00 2016-09-11 09:00:00   687.0     0.0     0.0   0.0
5  2016-09-11 09:00:00 2016-09-11 09:30:00  1800.0     0.0     0.0   0.0
6  2016-09-11 09:30:00 2016-09-11 10:00:00  1048.0     0.0     0.0   0.0
7  2016-09-11 10:30:00 2016-09-11 11:00:00     0.0     0.0     0.0   0.0
13 2016-09-11 14:00:00 2016-09-11 14:30:00     0.0   463.0   701.0   0.0
14 2016-09-11 14:30:00 2016-09-11 15:00:00     0.0   220.0     0.0   0.0
15 2016-09-11 15:00:00 2016-09-11 15:30:00     0.0   300.0  1277.0   0.0
16 2016-09-11 15:30:00 2016-09-11 16:00:00  1316.0     0.0    89.0   0.0
17 2016-09-11 16:00:00 2016-09-11 16:30:00   564.0   680.0     0.0   0.0
18 2016-09-11 16:30:00 2016-09-11 17:00:00     0.0  1654.0     0.0   0.0
19 2016-09-11 17:00:00 2016-09-11 17:30:00     0.0   389.0     0.0  20.0

执行此操作的方式非常缓慢:

根据此beautiful answer,我使用以下难以遵循的代码实现了我想要的目标:

from collections import namedtuple
Range = namedtuple('Range', ['start', 'end'])

def overlap(row1, row2):
    r1 = Range(start=row1.datetime_start, end=row1.datetime_end)
    r2 = Range(start=row2.datetime_start, end=row2.datetime_end)
    latest_start = max(r1.start, r2.start)
    earliest_end = min(r1.end, r2.end)
    delta = (earliest_end - latest_start).total_seconds()
    overlap = max(0, delta)
    return overlap

for cat in df2.catg.unique().tolist():
    df1[cat] = 0

for idx1, row1 in df1.iterrows():
    for idx2, row2 in df2.iterrows():
        if overlap(row1, row2) > 0:
            df1.loc[idx1, row2.catg] += overlap(row1, row2)

这很有效,但对于较大的数据帧来说它太慢了,它基本上是不可用的。如果有人有任何想法加快这一点,我会喜欢你的意见。

提前致谢,如果有什么不清楚,请告诉我!

数据框设置:

import pandas as pd
from pandas import Timestamp

d1 = {'datetime_start': {0: Timestamp('2016-09-11 06:00:00'), 1: Timestamp('2016-09-11 07:00:00'), 2: Timestamp('2016-09-11 07:30:00'), 3: Timestamp('2016-09-11 08:00:00'), 4: Timestamp('2016-09-11 08:30:00'), 5: Timestamp('2016-09-11 09:00:00'), 6: Timestamp('2016-09-11 09:30:00'), 7: Timestamp('2016-09-11 10:30:00'), 13: Timestamp('2016-09-11 14:00:00'), 14: Timestamp('2016-09-11 14:30:00'), 15: Timestamp('2016-09-11 15:00:00'), 16: Timestamp('2016-09-11 15:30:00'), 17: Timestamp('2016-09-11 16:00:00'), 18: Timestamp('2016-09-11 16:30:00'), 19: Timestamp('2016-09-11 17:00:00')}, 'datetime_end': {0: Timestamp('2016-09-11 06:30:00'), 1: Timestamp('2016-09-11 07:30:00'), 2: Timestamp('2016-09-11 08:00:00'), 3: Timestamp('2016-09-11 08:30:00'), 4: Timestamp('2016-09-11 09:00:00'), 5: Timestamp('2016-09-11 09:30:00'), 6: Timestamp('2016-09-11 10:00:00'), 7: Timestamp('2016-09-11 11:00:00'), 13: Timestamp('2016-09-11 14:30:00'), 14: Timestamp('2016-09-11 15:00:00'), 15: Timestamp('2016-09-11 15:30:00'), 16: Timestamp('2016-09-11 16:00:00'), 17: Timestamp('2016-09-11 16:30:00'), 18: Timestamp('2016-09-11 17:00:00'), 19: Timestamp('2016-09-11 17:30:00')}}

d2 = {'datetime_start': {4: Timestamp('2016-09-11 08:48:33'), 6: Timestamp('2016-09-11 09:54:25'), 8: Timestamp('2016-09-11 10:01:47'), 10: Timestamp('2016-09-11 10:08:00'), 12: Timestamp('2016-09-11 10:30:28'), 14: Timestamp('2016-09-11 10:38:18'), 18: Timestamp('2016-09-11 13:44:05'), 20: Timestamp('2016-09-11 13:46:52'), 23: Timestamp('2016-09-11 14:22:17'), 25: Timestamp('2016-09-11 15:00:12'), 27: Timestamp('2016-09-11 15:04:19'), 29: Timestamp('2016-09-11 15:08:43'), 31: Timestamp('2016-09-11 15:38:04'), 33: Timestamp('2016-09-11 16:18:40'), 35: Timestamp('2016-09-11 16:45:59'), 37: Timestamp('2016-09-11 17:08:31'), 39: Timestamp('2016-09-11 17:16:13'), 41: Timestamp('2016-09-11 17:17:23'), 45: Timestamp('2016-09-13 12:27:59'), 47: Timestamp('2016-09-13 12:38:39')}, 'datetime_end': {4: Timestamp('2016-09-11 09:41:53'), 6: Timestamp('2016-09-11 10:00:50'), 8: Timestamp('2016-09-11 10:04:55'), 10: Timestamp('2016-09-11 10:08:11'), 12: Timestamp('2016-09-11 10:30:28'), 14: Timestamp('2016-09-11 10:38:18'), 18: Timestamp('2016-09-11 13:44:05'), 20: Timestamp('2016-09-11 14:11:41'), 23: Timestamp('2016-09-11 14:33:40'), 25: Timestamp('2016-09-11 15:02:55'), 27: Timestamp('2016-09-11 15:06:36'), 29: Timestamp('2016-09-11 15:31:29'), 31: Timestamp('2016-09-11 16:09:24'), 33: Timestamp('2016-09-11 16:44:32'), 35: Timestamp('2016-09-11 16:59:01'), 37: Timestamp('2016-09-11 17:12:23'), 39: Timestamp('2016-09-11 17:16:33'), 41: Timestamp('2016-09-11 17:20:00'), 45: Timestamp('2016-09-13 12:34:21'), 47: Timestamp('2016-09-13 12:38:45')}, 'catg': {4: 'a', 6: 'a', 8: 'b', 10: 'b', 12: 'b', 14: 'a', 18: 'a', 20: 'd', 23: 'b', 25: 'b', 27: 'b', 29: 'd', 31: 'a', 33: 'b', 35: 'b', 37: 'b', 39: 'c', 41: 'b', 45: 'a', 47: 'a'}}

df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)

3 个答案:

答案 0 :(得分:2)

假设df1df2datetime_start列的升序排序(看起来如此),那么您只需要遍历两个数据帧的每一行,由于成对行比较,导致O(n)运行时间而不是当前O(n^2)

以下代码说明了这个想法。关键点是使用迭代器it1it2指向当前行。由于数据帧已排序,如果row1已经晚于row2,我们确信df1中的下一行晚于row2。用语言解释比代码更难:

def main(df1, df2):
    for cat in df2.catg.unique().tolist():
        df1[cat] = 0
    it1 = df1.iterrows()
    it2 = df2.iterrows()
    idx1, row1 = next(it1)
    idx2, row2 = next(it2)
    while True:
        try:
            r1 = Range(start=row1.datetime_start, end=row1.datetime_end)
            r2 = Range(start=row2.datetime_start, end=row2.datetime_end)
            if r2.end < r1.start:
                # no overlap. r2 before r1. advance it2
                idx2, row2 = next(it2)
            elif r1.end < r2.start:
                # no overlap. r1 before r2. advance it1
                idx1, row1 = next(it1)
            else:
                # overlap. overlap(row1, row2) must > 0 
                df1.loc[idx1, row2.catg] += overlap(row1, row2)
                # determine whether to advance it1 or it2
                if r1.end < r2.end:
                    # advance it1
                    idx1, row1 = next(it1)
                else:
                    # advance it2
                    idx2, row2 = next(it2)
        except StopIteration:
            break

main(df1, df2)

答案 1 :(得分:2)

根据timeit测试,每次执行100次,问题中的namedtuple方法在我的计算机上平均为15.7314秒,而平均值为1.4794秒这种方法:

# determine the duration of the events in df2, in seconds
duration = (df2.datetime_end - df2.datetime_start).dt.seconds.values

# create a numpy array with one timestamp for each second 
# in which an event occurred
seconds_range = np.repeat(df2.datetime_start.values, duration) + \
                np.concatenate(map(np.arange, duration)) * pd.Timedelta('1S')

df1.merge(pd.DataFrame({'datetime_start':seconds_range,
                        'catg':np.repeat(df2.catg, duration)}). \
              groupby(['catg', pd.Grouper(key='datetime_start', freq='30min')]). \
              size(). \
              unstack(level=0). \
              reset_index(), 
          how="left")

#           datetime_end      datetime_start       a       b     c       d
# 0  2016-09-11 06:30:00 2016-09-11 06:00:00     NaN     NaN   NaN     NaN
# 1  2016-09-11 07:30:00 2016-09-11 07:00:00     NaN     NaN   NaN     NaN
# 2  2016-09-11 08:00:00 2016-09-11 07:30:00     NaN     NaN   NaN     NaN
# 3  2016-09-11 08:30:00 2016-09-11 08:00:00     NaN     NaN   NaN     NaN
# 4  2016-09-11 09:00:00 2016-09-11 08:30:00   687.0     NaN   NaN     NaN
# 5  2016-09-11 09:30:00 2016-09-11 09:00:00  1800.0     NaN   NaN     NaN
# 6  2016-09-11 10:00:00 2016-09-11 09:30:00  1048.0     NaN   NaN     NaN
# 7  2016-09-11 11:00:00 2016-09-11 10:30:00     NaN     NaN   NaN     NaN
# 8  2016-09-11 14:30:00 2016-09-11 14:00:00     NaN   463.0   NaN   701.0
# 9  2016-09-11 15:00:00 2016-09-11 14:30:00     NaN   220.0   NaN     NaN
# 10 2016-09-11 15:30:00 2016-09-11 15:00:00     NaN   300.0   NaN  1277.0
# 11 2016-09-11 16:00:00 2016-09-11 15:30:00  1316.0     NaN   NaN    89.0
# 12 2016-09-11 16:30:00 2016-09-11 16:00:00   564.0   680.0   NaN     NaN
# 13 2016-09-11 17:00:00 2016-09-11 16:30:00     NaN  1654.0   NaN     NaN
# 14 2016-09-11 17:30:00 2016-09-11 17:00:00     NaN   389.0  20.0     NaN

答案 2 :(得分:1)

通过一些更改,您应该会看到显着的(在我的测试中约为8倍)性能提升。代码的结构保持不变:

def overlap(row1, row2):
    return max(0, (min(row1[0], row2[0]) - max(row1[1], row2[1])) / np.timedelta64(1, 's'))

df1 = df1.join(pd.DataFrame(dict.fromkeys(df2.catg.unique(), 0), index=df1.index))

for idx1, row1 in enumerate(df1.iloc[:, :2].values):
    for catg, row2 in zip(df2['catg'], df2.iloc[:, 1:3].values):
        df1.iat[idx1, df1.columns.get_loc(catg)] += overlap(row1, row2)

您可以通过numba进一步了解此问题,或者做一些隐藏所有逻辑的聪明pandas内容。

<强>解释

  1. 使用df.itertuples代替df.iterrows
  2. 使用df.iat代替df.loc
  3. 使用numpy代替pandas时间对象
  4. 删除命名元组创建
  5. 删除重复重叠计算
  6. 改进重叠算法
  7. <强>结果

              datetime_end      datetime_start     a     b   c     d
    0  2016-09-11 06:30:00 2016-09-11 06:00:00     0     0   0     0
    1  2016-09-11 07:30:00 2016-09-11 07:00:00     0     0   0     0
    2  2016-09-11 08:00:00 2016-09-11 07:30:00     0     0   0     0
    3  2016-09-11 08:30:00 2016-09-11 08:00:00     0     0   0     0
    4  2016-09-11 09:00:00 2016-09-11 08:30:00   687     0   0     0
    5  2016-09-11 09:30:00 2016-09-11 09:00:00  1800     0   0     0
    6  2016-09-11 10:00:00 2016-09-11 09:30:00  1048     0   0     0
    7  2016-09-11 11:00:00 2016-09-11 10:30:00     0     0   0     0
    13 2016-09-11 14:30:00 2016-09-11 14:00:00     0   463   0   701
    14 2016-09-11 15:00:00 2016-09-11 14:30:00     0   220   0     0
    15 2016-09-11 15:30:00 2016-09-11 15:00:00     0   300   0  1277
    16 2016-09-11 16:00:00 2016-09-11 15:30:00  1316     0   0    89
    17 2016-09-11 16:30:00 2016-09-11 16:00:00   564   680   0     0
    18 2016-09-11 17:00:00 2016-09-11 16:30:00     0  1654   0     0
    19 2016-09-11 17:30:00 2016-09-11 17:00:00     0   389  20     0