关于在日期或时间范围(for example)中找到重叠,有一些问题。我已经使用这些来解决我的问题,但我最终得到了一个非常缓慢(并且不是优雅)解决我的问题的解决方案。如果有人知道如何让这个更快(并且更优雅),我将非常感激:
问题:
我有2个数据框public static String encode(String inputString) {
try {
inputString = Encryptor.toMd5Hash(inputString);
} catch (NoSuchAlgorithmException e) {
e.printStackTrace();
}
try {
if (prepare() && initCipher(Cipher.ENCRYPT_MODE)) {
byte[] bytes = sCipher.doFinal(inputString.getBytes()); // length is 256
return Base64.encodeToString(bytes, Base64.NO_WRAP);
}
} catch (BadPaddingException e) {
e.printStackTrace();
} catch (IllegalBlockSizeException e) {
e.printStackTrace();
}
return null;
}
和df1
,每个数据框有2列代表开始时间和结束时间:
df2
我想要的是找到>>> df1
datetime_start datetime_end
0 2016-09-11 06:00:00 2016-09-11 06:30:00
1 2016-09-11 07:00:00 2016-09-11 07:30:00
2 2016-09-11 07:30:00 2016-09-11 08:00:00
3 2016-09-11 08:00:00 2016-09-11 08:30:00
4 2016-09-11 08:30:00 2016-09-11 09:00:00
5 2016-09-11 09:00:00 2016-09-11 09:30:00
6 2016-09-11 09:30:00 2016-09-11 10:00:00
7 2016-09-11 10:30:00 2016-09-11 11:00:00
13 2016-09-11 14:00:00 2016-09-11 14:30:00
14 2016-09-11 14:30:00 2016-09-11 15:00:00
15 2016-09-11 15:00:00 2016-09-11 15:30:00
16 2016-09-11 15:30:00 2016-09-11 16:00:00
17 2016-09-11 16:00:00 2016-09-11 16:30:00
18 2016-09-11 16:30:00 2016-09-11 17:00:00
19 2016-09-11 17:00:00 2016-09-11 17:30:00
>>> df2
datetime_start datetime_end catg
4 2016-09-11 08:48:33 2016-09-11 09:41:53 a
6 2016-09-11 09:54:25 2016-09-11 10:00:50 a
8 2016-09-11 10:01:47 2016-09-11 10:04:55 b
10 2016-09-11 10:08:00 2016-09-11 10:08:11 b
12 2016-09-11 10:30:28 2016-09-11 10:30:28 b
14 2016-09-11 10:38:18 2016-09-11 10:38:18 a
18 2016-09-11 13:44:05 2016-09-11 13:44:05 a
20 2016-09-11 13:46:52 2016-09-11 14:11:41 d
23 2016-09-11 14:22:17 2016-09-11 14:33:40 b
25 2016-09-11 15:00:12 2016-09-11 15:02:55 b
27 2016-09-11 15:04:19 2016-09-11 15:06:36 b
29 2016-09-11 15:08:43 2016-09-11 15:31:29 d
31 2016-09-11 15:38:04 2016-09-11 16:09:24 a
33 2016-09-11 16:18:40 2016-09-11 16:44:32 b
35 2016-09-11 16:45:59 2016-09-11 16:59:01 b
37 2016-09-11 17:08:31 2016-09-11 17:12:23 b
39 2016-09-11 17:16:13 2016-09-11 17:16:33 c
41 2016-09-11 17:17:23 2016-09-11 17:20:00 b
45 2016-09-13 12:27:59 2016-09-13 12:34:21 a
47 2016-09-13 12:38:39 2016-09-13 12:38:45 a
中的范围与df2
中的范围重叠的位置,重叠的时间长度(以秒为单位)以及df1
的值df2.catg
是。我希望将重叠的长度插入到df1
中的列中(该列将以其代表的catg
命名。)
期望的输出:
>>> df1
datetime_start datetime_end a b d c
0 2016-09-11 06:00:00 2016-09-11 06:30:00 0.0 0.0 0.0 0.0
1 2016-09-11 07:00:00 2016-09-11 07:30:00 0.0 0.0 0.0 0.0
2 2016-09-11 07:30:00 2016-09-11 08:00:00 0.0 0.0 0.0 0.0
3 2016-09-11 08:00:00 2016-09-11 08:30:00 0.0 0.0 0.0 0.0
4 2016-09-11 08:30:00 2016-09-11 09:00:00 687.0 0.0 0.0 0.0
5 2016-09-11 09:00:00 2016-09-11 09:30:00 1800.0 0.0 0.0 0.0
6 2016-09-11 09:30:00 2016-09-11 10:00:00 1048.0 0.0 0.0 0.0
7 2016-09-11 10:30:00 2016-09-11 11:00:00 0.0 0.0 0.0 0.0
13 2016-09-11 14:00:00 2016-09-11 14:30:00 0.0 463.0 701.0 0.0
14 2016-09-11 14:30:00 2016-09-11 15:00:00 0.0 220.0 0.0 0.0
15 2016-09-11 15:00:00 2016-09-11 15:30:00 0.0 300.0 1277.0 0.0
16 2016-09-11 15:30:00 2016-09-11 16:00:00 1316.0 0.0 89.0 0.0
17 2016-09-11 16:00:00 2016-09-11 16:30:00 564.0 680.0 0.0 0.0
18 2016-09-11 16:30:00 2016-09-11 17:00:00 0.0 1654.0 0.0 0.0
19 2016-09-11 17:00:00 2016-09-11 17:30:00 0.0 389.0 0.0 20.0
执行此操作的方式非常缓慢:
根据此beautiful answer,我使用以下难以遵循的代码实现了我想要的目标:
from collections import namedtuple
Range = namedtuple('Range', ['start', 'end'])
def overlap(row1, row2):
r1 = Range(start=row1.datetime_start, end=row1.datetime_end)
r2 = Range(start=row2.datetime_start, end=row2.datetime_end)
latest_start = max(r1.start, r2.start)
earliest_end = min(r1.end, r2.end)
delta = (earliest_end - latest_start).total_seconds()
overlap = max(0, delta)
return overlap
for cat in df2.catg.unique().tolist():
df1[cat] = 0
for idx1, row1 in df1.iterrows():
for idx2, row2 in df2.iterrows():
if overlap(row1, row2) > 0:
df1.loc[idx1, row2.catg] += overlap(row1, row2)
这很有效,但对于较大的数据帧来说它太慢了,它基本上是不可用的。如果有人有任何想法加快这一点,我会喜欢你的意见。
提前致谢,如果有什么不清楚,请告诉我!
数据框设置:
import pandas as pd
from pandas import Timestamp
d1 = {'datetime_start': {0: Timestamp('2016-09-11 06:00:00'), 1: Timestamp('2016-09-11 07:00:00'), 2: Timestamp('2016-09-11 07:30:00'), 3: Timestamp('2016-09-11 08:00:00'), 4: Timestamp('2016-09-11 08:30:00'), 5: Timestamp('2016-09-11 09:00:00'), 6: Timestamp('2016-09-11 09:30:00'), 7: Timestamp('2016-09-11 10:30:00'), 13: Timestamp('2016-09-11 14:00:00'), 14: Timestamp('2016-09-11 14:30:00'), 15: Timestamp('2016-09-11 15:00:00'), 16: Timestamp('2016-09-11 15:30:00'), 17: Timestamp('2016-09-11 16:00:00'), 18: Timestamp('2016-09-11 16:30:00'), 19: Timestamp('2016-09-11 17:00:00')}, 'datetime_end': {0: Timestamp('2016-09-11 06:30:00'), 1: Timestamp('2016-09-11 07:30:00'), 2: Timestamp('2016-09-11 08:00:00'), 3: Timestamp('2016-09-11 08:30:00'), 4: Timestamp('2016-09-11 09:00:00'), 5: Timestamp('2016-09-11 09:30:00'), 6: Timestamp('2016-09-11 10:00:00'), 7: Timestamp('2016-09-11 11:00:00'), 13: Timestamp('2016-09-11 14:30:00'), 14: Timestamp('2016-09-11 15:00:00'), 15: Timestamp('2016-09-11 15:30:00'), 16: Timestamp('2016-09-11 16:00:00'), 17: Timestamp('2016-09-11 16:30:00'), 18: Timestamp('2016-09-11 17:00:00'), 19: Timestamp('2016-09-11 17:30:00')}}
d2 = {'datetime_start': {4: Timestamp('2016-09-11 08:48:33'), 6: Timestamp('2016-09-11 09:54:25'), 8: Timestamp('2016-09-11 10:01:47'), 10: Timestamp('2016-09-11 10:08:00'), 12: Timestamp('2016-09-11 10:30:28'), 14: Timestamp('2016-09-11 10:38:18'), 18: Timestamp('2016-09-11 13:44:05'), 20: Timestamp('2016-09-11 13:46:52'), 23: Timestamp('2016-09-11 14:22:17'), 25: Timestamp('2016-09-11 15:00:12'), 27: Timestamp('2016-09-11 15:04:19'), 29: Timestamp('2016-09-11 15:08:43'), 31: Timestamp('2016-09-11 15:38:04'), 33: Timestamp('2016-09-11 16:18:40'), 35: Timestamp('2016-09-11 16:45:59'), 37: Timestamp('2016-09-11 17:08:31'), 39: Timestamp('2016-09-11 17:16:13'), 41: Timestamp('2016-09-11 17:17:23'), 45: Timestamp('2016-09-13 12:27:59'), 47: Timestamp('2016-09-13 12:38:39')}, 'datetime_end': {4: Timestamp('2016-09-11 09:41:53'), 6: Timestamp('2016-09-11 10:00:50'), 8: Timestamp('2016-09-11 10:04:55'), 10: Timestamp('2016-09-11 10:08:11'), 12: Timestamp('2016-09-11 10:30:28'), 14: Timestamp('2016-09-11 10:38:18'), 18: Timestamp('2016-09-11 13:44:05'), 20: Timestamp('2016-09-11 14:11:41'), 23: Timestamp('2016-09-11 14:33:40'), 25: Timestamp('2016-09-11 15:02:55'), 27: Timestamp('2016-09-11 15:06:36'), 29: Timestamp('2016-09-11 15:31:29'), 31: Timestamp('2016-09-11 16:09:24'), 33: Timestamp('2016-09-11 16:44:32'), 35: Timestamp('2016-09-11 16:59:01'), 37: Timestamp('2016-09-11 17:12:23'), 39: Timestamp('2016-09-11 17:16:33'), 41: Timestamp('2016-09-11 17:20:00'), 45: Timestamp('2016-09-13 12:34:21'), 47: Timestamp('2016-09-13 12:38:45')}, 'catg': {4: 'a', 6: 'a', 8: 'b', 10: 'b', 12: 'b', 14: 'a', 18: 'a', 20: 'd', 23: 'b', 25: 'b', 27: 'b', 29: 'd', 31: 'a', 33: 'b', 35: 'b', 37: 'b', 39: 'c', 41: 'b', 45: 'a', 47: 'a'}}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
答案 0 :(得分:2)
假设df1
和df2
按datetime_start
列的升序排序(看起来如此),那么您只需要遍历两个数据帧的每一行,由于成对行比较,导致O(n)
运行时间而不是当前O(n^2)
。
以下代码说明了这个想法。关键点是使用迭代器it1
和it2
指向当前行。由于数据帧已排序,如果row1已经晚于row2,我们确信df1中的下一行晚于row2。用语言解释比代码更难:
def main(df1, df2):
for cat in df2.catg.unique().tolist():
df1[cat] = 0
it1 = df1.iterrows()
it2 = df2.iterrows()
idx1, row1 = next(it1)
idx2, row2 = next(it2)
while True:
try:
r1 = Range(start=row1.datetime_start, end=row1.datetime_end)
r2 = Range(start=row2.datetime_start, end=row2.datetime_end)
if r2.end < r1.start:
# no overlap. r2 before r1. advance it2
idx2, row2 = next(it2)
elif r1.end < r2.start:
# no overlap. r1 before r2. advance it1
idx1, row1 = next(it1)
else:
# overlap. overlap(row1, row2) must > 0
df1.loc[idx1, row2.catg] += overlap(row1, row2)
# determine whether to advance it1 or it2
if r1.end < r2.end:
# advance it1
idx1, row1 = next(it1)
else:
# advance it2
idx2, row2 = next(it2)
except StopIteration:
break
main(df1, df2)
答案 1 :(得分:2)
根据timeit
测试,每次执行100次,问题中的namedtuple
方法在我的计算机上平均为15.7314
秒,而平均值为1.4794
秒这种方法:
# determine the duration of the events in df2, in seconds
duration = (df2.datetime_end - df2.datetime_start).dt.seconds.values
# create a numpy array with one timestamp for each second
# in which an event occurred
seconds_range = np.repeat(df2.datetime_start.values, duration) + \
np.concatenate(map(np.arange, duration)) * pd.Timedelta('1S')
df1.merge(pd.DataFrame({'datetime_start':seconds_range,
'catg':np.repeat(df2.catg, duration)}). \
groupby(['catg', pd.Grouper(key='datetime_start', freq='30min')]). \
size(). \
unstack(level=0). \
reset_index(),
how="left")
# datetime_end datetime_start a b c d
# 0 2016-09-11 06:30:00 2016-09-11 06:00:00 NaN NaN NaN NaN
# 1 2016-09-11 07:30:00 2016-09-11 07:00:00 NaN NaN NaN NaN
# 2 2016-09-11 08:00:00 2016-09-11 07:30:00 NaN NaN NaN NaN
# 3 2016-09-11 08:30:00 2016-09-11 08:00:00 NaN NaN NaN NaN
# 4 2016-09-11 09:00:00 2016-09-11 08:30:00 687.0 NaN NaN NaN
# 5 2016-09-11 09:30:00 2016-09-11 09:00:00 1800.0 NaN NaN NaN
# 6 2016-09-11 10:00:00 2016-09-11 09:30:00 1048.0 NaN NaN NaN
# 7 2016-09-11 11:00:00 2016-09-11 10:30:00 NaN NaN NaN NaN
# 8 2016-09-11 14:30:00 2016-09-11 14:00:00 NaN 463.0 NaN 701.0
# 9 2016-09-11 15:00:00 2016-09-11 14:30:00 NaN 220.0 NaN NaN
# 10 2016-09-11 15:30:00 2016-09-11 15:00:00 NaN 300.0 NaN 1277.0
# 11 2016-09-11 16:00:00 2016-09-11 15:30:00 1316.0 NaN NaN 89.0
# 12 2016-09-11 16:30:00 2016-09-11 16:00:00 564.0 680.0 NaN NaN
# 13 2016-09-11 17:00:00 2016-09-11 16:30:00 NaN 1654.0 NaN NaN
# 14 2016-09-11 17:30:00 2016-09-11 17:00:00 NaN 389.0 20.0 NaN
答案 2 :(得分:1)
通过一些更改,您应该会看到显着的(在我的测试中约为8倍)性能提升。代码的结构保持不变:
def overlap(row1, row2):
return max(0, (min(row1[0], row2[0]) - max(row1[1], row2[1])) / np.timedelta64(1, 's'))
df1 = df1.join(pd.DataFrame(dict.fromkeys(df2.catg.unique(), 0), index=df1.index))
for idx1, row1 in enumerate(df1.iloc[:, :2].values):
for catg, row2 in zip(df2['catg'], df2.iloc[:, 1:3].values):
df1.iat[idx1, df1.columns.get_loc(catg)] += overlap(row1, row2)
您可以通过numba
进一步了解此问题,或者做一些隐藏所有逻辑的聪明pandas
内容。
<强>解释强>
df.itertuples
代替df.iterrows
df.iat
代替df.loc
numpy
代替pandas
时间对象<强>结果强>
datetime_end datetime_start a b c d
0 2016-09-11 06:30:00 2016-09-11 06:00:00 0 0 0 0
1 2016-09-11 07:30:00 2016-09-11 07:00:00 0 0 0 0
2 2016-09-11 08:00:00 2016-09-11 07:30:00 0 0 0 0
3 2016-09-11 08:30:00 2016-09-11 08:00:00 0 0 0 0
4 2016-09-11 09:00:00 2016-09-11 08:30:00 687 0 0 0
5 2016-09-11 09:30:00 2016-09-11 09:00:00 1800 0 0 0
6 2016-09-11 10:00:00 2016-09-11 09:30:00 1048 0 0 0
7 2016-09-11 11:00:00 2016-09-11 10:30:00 0 0 0 0
13 2016-09-11 14:30:00 2016-09-11 14:00:00 0 463 0 701
14 2016-09-11 15:00:00 2016-09-11 14:30:00 0 220 0 0
15 2016-09-11 15:30:00 2016-09-11 15:00:00 0 300 0 1277
16 2016-09-11 16:00:00 2016-09-11 15:30:00 1316 0 0 89
17 2016-09-11 16:30:00 2016-09-11 16:00:00 564 680 0 0
18 2016-09-11 17:00:00 2016-09-11 16:30:00 0 1654 0 0
19 2016-09-11 17:30:00 2016-09-11 17:00:00 0 389 20 0