我有两个数据帧(非常长,每行数百或数千行)。其中一个名为df1
,包含一个时间序列,间隔为10分钟。例如:
date value 2016-11-24 00:00:00 1759.199951 2016-11-24 00:10:00 992.400024 2016-11-24 00:20:00 1404.800049 2016-11-24 00:30:00 45.799999 2016-11-24 00:40:00 24.299999 2016-11-24 00:50:00 159.899994 2016-11-24 01:00:00 82.499999 2016-11-24 01:10:00 37.400003 2016-11-24 01:20:00 159.899994 ....
另一个df2
包含日期时间间隔:
start_date end_date 0 2016-11-23 23:55:32 2016-11-24 00:14:03 1 2016-11-24 01:03:18 2016-11-24 01:07:12 2 2016-11-24 01:11:32 2016-11-24 02:00:00 ...
我需要选择df1
中的所有行" fall"进入df2
的区间。
通过这些示例,结果数据框应为:
date value 2016-11-24 00:00:00 1759.199951 # Fits in row 0 of df2 2016-11-24 00:10:00 992.400024 # Fits in row 0 of df2 2016-11-24 01:00:00 82.499999 # Fits in row 1 of df2 2016-11-24 01:10:00 37.400003 # Fits on row 2 of df2 2016-11-24 01:20:00 159.899994 # Fits in row 2 of df2 ....
答案 0 :(得分:7)
np.searchsorted
:此处'基于np.searchsorted
的变体似乎比使用intervaltree
或merge
快一个数量级,假设我的较大样本数据是正确的。
# Ensure the df2 is sorted (skip if it's already known to be).
df2 = df2.sort_values(by=['start_date', 'end_date'])
# Add the end of the time interval to df1.
df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)
# Perform the searchsorted and get the corresponding df2 values for both endpoints of df1.
s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)
# Build the conditions that indicate an overlap (any True condition indicates an overlap).
cond = [
df1['date'].values <= s1['end_date'].values,
df1['date_end'].values <= s2['end_date'].values,
s1.index.values != s2.index.values
]
# Filter df1 to only the overlapping intervals, and drop the extra 'date_end' column.
df1 = df1[np.any(cond, axis=0)].drop('date_end', axis=1)
如果df2
中的间隔嵌套或重叠,则可能需要修改;在那种情况下,我还没有完全考虑过它,但它仍然有用。
不完全是纯粹的Pandas解决方案,但您可能需要考虑从df2
构建Interval Tree,并根据df1
中的间隔查询它以查找重叠的那些。
PyPI上的intervaltree
包似乎具有良好的性能和易于使用的语法。
from intervaltree import IntervalTree
# Build the Interval Tree from df2.
tree = IntervalTree.from_tuples(df2.astype('int64').values + [0, 1])
# Build the 10 minutes spans from df1.
dt_pairs = pd.concat([df1['date'], df1['date'] + pd.offsets.Minute(10)], axis=1)
# Query the Interval Tree to filter df1.
df1 = df1[[tree.overlaps(*p) for p in dt_pairs.astype('int64').values]]
出于性能原因,我将日期转换为等值整数。我怀疑intervaltree
包的构建时考虑了pd.Timestamp
,因此可能会有一些中间转换步骤会使事情变慢。
另请注意,intervaltree
包中的间隔不包括终点,但包含起点。这就是我在创建+ [0, 1]
时拥有tree
的原因;我将终点填充纳秒,以确保实际包含终点。这也是为什么在查询树时添加pd.offsets.Minute(10)
来获取间隔结束而不是仅添加9m 59s的原因。
任一方法的结果输出:
date value
0 2016-11-24 00:00:00 1759.199951
1 2016-11-24 00:10:00 992.400024
6 2016-11-24 01:00:00 82.499999
7 2016-11-24 01:10:00 37.400003
8 2016-11-24 01:20:00 159.899994
使用以下设置生成更大的样本数据:
# Sample df1.
n1 = 55000
df1 = pd.DataFrame({'date': pd.date_range('2016-11-24', freq='10T', periods=n1), 'value': np.random.random(n1)})
# Sample df2.
n2 = 500
df2 = pd.DataFrame({'start_date': pd.date_range('2016-11-24', freq='18H22T', periods=n2)})
# Randomly shift the start and end dates of the df2 intervals.
shift_start = pd.Series(np.random.randint(30, size=n2)).cumsum().apply(lambda s: pd.DateOffset(seconds=s))
shift_end1 = pd.Series(np.random.randint(30, size=n2)).apply(lambda s: pd.DateOffset(seconds=s))
shift_end2 = pd.Series(np.random.randint(5, 45, size=n2)).apply(lambda m: pd.DateOffset(minutes=m))
df2['start_date'] += shift_start
df2['end_date'] = df2['start_date'] + shift_end1 + shift_end2
为df1
和df2
生成以下内容:
df1
date value
0 2016-11-24 00:00:00 0.444939
1 2016-11-24 00:10:00 0.407554
2 2016-11-24 00:20:00 0.460148
3 2016-11-24 00:30:00 0.465239
4 2016-11-24 00:40:00 0.462691
...
54995 2017-12-10 21:50:00 0.754123
54996 2017-12-10 22:00:00 0.401820
54997 2017-12-10 22:10:00 0.146284
54998 2017-12-10 22:20:00 0.394759
54999 2017-12-10 22:30:00 0.907233
df2
start_date end_date
0 2016-11-24 00:00:19 2016-11-24 00:41:24
1 2016-11-24 18:22:44 2016-11-24 18:36:44
2 2016-11-25 12:44:44 2016-11-25 13:03:13
3 2016-11-26 07:07:05 2016-11-26 07:49:29
4 2016-11-27 01:29:31 2016-11-27 01:34:32
...
495 2017-12-07 21:36:04 2017-12-07 22:14:29
496 2017-12-08 15:58:14 2017-12-08 16:10:35
497 2017-12-09 10:20:21 2017-12-09 10:26:40
498 2017-12-10 04:42:41 2017-12-10 05:22:47
499 2017-12-10 23:04:42 2017-12-10 23:44:53
使用以下功能进行计时:
def root_searchsorted(df1, df2):
# Add the end of the time interval to df1.
df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)
# Get the insertion indexes for the endpoints of the intervals from df1.
s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)
# Build the conditions that indicate an overlap (any True condition indicates an overlap).
cond = [
df1['date'].values <= s1['end_date'].values,
df1['date_end'].values <= s2['end_date'].values,
s1.index.values != s2.index.values
]
# Filter df1 to only the overlapping intervals, and drop the extra 'date_end' column.
return df1[np.any(cond, axis=0)].drop('date_end', axis=1)
def root_intervaltree(df1, df2):
# Build the Interval Tree.
tree = IntervalTree.from_tuples(df2.astype('int64').values + [0, 1])
# Build the 10 minutes spans from df1.
dt_pairs = pd.concat([df1['date'], df1['date'] + pd.offsets.Minute(10)], axis=1)
# Query the Interval Tree to filter the DataFrame.
return df1[[tree.overlaps(*p) for p in dt_pairs.astype('int64').values]]
def ptrj(df1, df2):
# The smallest amount of time - handy when using open intervals:
epsilon = pd.Timedelta(1, 'ns')
# Lookup series (`asof` works best with series) for `start_date` and `end_date` from `df2`:
sdate = pd.Series(data=range(df2.shape[0]), index=df2.start_date)
edate = pd.Series(data=range(df2.shape[0]), index=df2.end_date + epsilon)
# (filling NaN's with -1)
l = edate.asof(df1.date).fillna(-1)
r = sdate.asof(df1.date + (pd.Timedelta(10, 'm') - epsilon)).fillna(-1)
# (taking `values` here to skip indexes, which are different)
mask = l.values < r.values
return df1[mask]
def parfait(df1, df2):
df1['key'] = 1
df2['key'] = 1
df2['row'] = df2.index.values
# CROSS JOIN
df3 = pd.merge(df1, df2, on=['key'])
# DF FILTERING
return df3[df3['start_date'].between(df3['date'], df3['date'] + dt.timedelta(minutes=9, seconds=59), inclusive=True) | df3['date'].between(df3['start_date'], df3['end_date'], inclusive=True)].set_index('date')[['value', 'row']]
def root_searchsorted_modified(df1, df2):
# Add the end of the time interval to df1.
df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)
# Get the insertion indexes for the endpoints of the intervals from df1.
s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)
# ---- further is the MODIFIED code ----
# Filter df1 to only overlapping intervals.
df1.query('(date <= @s1.end_date.values) |\
(date_end <= @s1.end_date.values) |\
(@s1.index.values != @s2.index.values)', inplace=True)
# Drop the extra 'date_end' column.
return df1.drop('date_end', axis=1)
我得到以下时间:
%timeit root_searchsorted(df1.copy(), df2.copy())
100 loops best of 3: 9.55 ms per loop
%timeit root_searchsorted_modified(df1.copy(), df2.copy())
100 loops best of 3: 13.5 ms per loop
%timeit ptrj(df1.copy(), df2.copy())
100 loops best of 3: 18.5 ms per loop
%timeit root_intervaltree(df1.copy(), df2.copy())
1 loop best of 3: 4.02 s per loop
%timeit parfait(df1.copy(), df2.copy())
1 loop best of 3: 8.96 s per loop
答案 1 :(得分:3)
此解决方案(我相信它有效)使用pandas.Series.asof
。在引擎盖下,它是搜索的一些版本 - 但由于某种原因,它的速度比快四倍,它的速度与@ root&amp; s函数相当
我假设所有日期列都是pandas datetime
格式,已排序,且df2间隔不重叠。
代码很短但有点错综复杂(下面的解释)。
# The smallest amount of time - handy when using open intervals:
epsilon = pd.Timedelta(1, 'ns')
# Lookup series (`asof` works best with series) for `start_date` and `end_date` from `df2`:
sdate = pd.Series(data=range(df2.shape[0]), index=df2.start_date)
edate = pd.Series(data=range(df2.shape[0]), index=df2.end_date + epsilon)
# The main function (see explanation below):
def get_it(df1):
# (filling NaN's with -1)
l = edate.asof(df1.date).fillna(-1)
r = sdate.asof(df1.date + (pd.Timedelta(10, 'm') - epsilon)).fillna(-1)
# (taking `values` here to skip indexes, which are different)
mask = l.values < r.values
return df1[mask]
这种方法的优点有两个:sdate
和edate
仅评估一次,如果df1
非常大,主函数可以采用df1
的块。< / p>
<强>解释强>
pandas.Series.asof返回给定索引的最后一个有效行。它可以将数组作为输入并且非常快。
为了便于解释,请s[j] = sdate.index[j]
为sdate
中的第j个日期,x
为某个任意日期(时间戳)。
始终有s[sdate.asof(x)] <= x
(这正是asof
的工作原理)并且不难显示:
j <= sdate.asof(x)
当且仅当s[j] <= x
sdate.asof(x) < j
当且仅当x < s[j]
同样适用于edate
。不幸的是,我们在1和2中都不能有相同的不等式(无论是一周还是严格)。
两个间隔[a,b)和[x,y]相交iff x < b和a&lt; = y。
(我们可能会认为a,b来自sdate.index
和edate.index
- 由于属性1和2,区间[a,b]被选择为闭合开放。)
在我们的例子中,x是df1
的日期,y = x + 10min - epsilon,
a = s [j],b = e [j](注意epsilon已被添加到edate
),其中j是某个数字。
所以,最后,相当于&#34; [a,b)和[x,y]的条件相交&#34;是
&#34; sdate.asof(x)&lt; j和j&lt; = edate.asof(y)对于某些数字j&#34;。它大致归结为函数l < r
内的get_it
(以某些技术性为模)。
答案 2 :(得分:2)
这并不简单,但您可以执行以下操作:
首先从两个数据框中获取相关日期列并将它们连接在一起,以便一列是所有日期,另外两列是表示来自df2的索引的列。 (注意df2在堆叠后获得多索引)
dfm = pd.concat((df1['date'],df2.stack().reset_index())).sort_values(0)
print(dfm)
0 level_0 level_1
0 2016-11-23 23:55:32 0.0 start_date
0 2016-11-24 00:00:00 NaN NaN
1 2016-11-24 00:10:00 NaN NaN
1 2016-11-24 00:14:03 0.0 end_date
2 2016-11-24 00:20:00 NaN NaN
3 2016-11-24 00:30:00 NaN NaN
4 2016-11-24 00:40:00 NaN NaN
5 2016-11-24 00:50:00 NaN NaN
6 2016-11-24 01:00:00 NaN NaN
2 2016-11-24 01:03:18 1.0 start_date
3 2016-11-24 01:07:12 1.0 end_date
7 2016-11-24 01:10:00 NaN NaN
4 2016-11-24 01:11:32 2.0 start_date
8 2016-11-24 01:20:00 NaN NaN
5 2016-11-24 02:00:00 2.0 end_date
您可以看到df1中的值在右侧两列中有NaN
,由于我们对日期进行了排序,因此这些行位于start_date
和end_date
行之间(来自DF2)。
为了表明df1中的行落在df2的行之间,我们可以插入level_0
列,它给出了我们:
dfm['level_0'] = dfm['level_0'].interpolate()
0 level_0 level_1
0 2016-11-23 23:55:32 0.000000 start_date
0 2016-11-24 00:00:00 0.000000 NaN
1 2016-11-24 00:10:00 0.000000 NaN
1 2016-11-24 00:14:03 0.000000 end_date
2 2016-11-24 00:20:00 0.166667 NaN
3 2016-11-24 00:30:00 0.333333 NaN
4 2016-11-24 00:40:00 0.500000 NaN
5 2016-11-24 00:50:00 0.666667 NaN
6 2016-11-24 01:00:00 0.833333 NaN
2 2016-11-24 01:03:18 1.000000 start_date
3 2016-11-24 01:07:12 1.000000 end_date
7 2016-11-24 01:10:00 1.500000 NaN
4 2016-11-24 01:11:32 2.000000 start_date
8 2016-11-24 01:20:00 2.000000 NaN
5 2016-11-24 02:00:00 2.000000 end_date
请注意,level_0
列现在包含介于开始日期和结束日期之间的行的整数(数学上,而不是数据类型)(这假定结束日期不会与下一个开始日期重叠) )。
现在我们可以过滤出最初在df1中的行:
df_falls = dfm[(dfm['level_0'] == dfm['level_0'].astype(int)) & (dfm['level_1'].isnull())][[0,'level_0']]
df_falls.columns = ['date', 'falls_index']
并与原始数据框合并
df_final = pd.merge(df1, right=df_falls, on='date', how='outer')
给出:
print(df_final)
date value falls_index
0 2016-11-24 00:00:00 1759.199951 0.0
1 2016-11-24 00:10:00 992.400024 0.0
2 2016-11-24 00:20:00 1404.800049 NaN
3 2016-11-24 00:30:00 45.799999 NaN
4 2016-11-24 00:40:00 24.299999 NaN
5 2016-11-24 00:50:00 159.899994 NaN
6 2016-11-24 01:00:00 82.499999 NaN
7 2016-11-24 01:10:00 37.400003 NaN
8 2016-11-24 01:20:00 159.899994 2.0
与具有额外列falls_index
的原始数据帧相同,后者表示该行所属的df2
行的索引。
答案 3 :(得分:2)
考虑交叉连接合并,它返回两个集合之间的笛卡尔积(所有可能的行对M x N)。您可以使用merge on
参数中的全1键列来交叉连接。然后,使用pd.series.between()
在大型返回集上运行过滤器。具体而言,系列between()
会保留开始日期在date
或date
的9:59范围内的行落在开始和结束时间内。
但是,在合并之前,创建一个等于日期索引的df1['date']
列,以便它在合并后可以是保留列并用于日期过滤。此外,创建一个df2['row']
列,以便在结尾处用作行指示符。对于演示,下面重新创建已发布的df1和df2数据帧:
from io import StringIO
import pandas as pd
import datetime as dt
data1 = '''
date value
"2016-11-24 00:00:00" 1759.199951
"2016-11-24 00:10:00" 992.400024
"2016-11-24 00:20:00" 1404.800049
"2016-11-24 00:30:00" 45.799999
"2016-11-24 00:40:00" 24.299999
"2016-11-24 00:50:00" 159.899994
"2016-11-24 01:00:00" 82.499999
"2016-11-24 01:10:00" 37.400003
"2016-11-24 01:20:00" 159.899994
'''
df1 = pd.read_table(StringIO(data1), sep='\s+', parse_dates=[0], index_col=0)
df1['key'] = 1
df1['date'] = df1.index.values
data2 = '''
start_date end_date
"2016-11-23 23:55:32" "2016-11-24 00:14:03"
"2016-11-24 01:03:18" "2016-11-24 01:07:12"
"2016-11-24 01:11:32" "2016-11-24 02:00:00"
'''
df2['key'] = 1
df2['row'] = df2.index.values
df2 = pd.read_table(StringIO(data2), sep='\s+', parse_dates=[0,1])
# CROSS JOIN
df3 = pd.merge(df1, df2, on=['key'])
# DF FILTERING
df3 = df3[(df3['start_date'].between(df3['date'], df3['date'] + dt.timedelta(minutes=9), seconds=59), inclusive=True)) |
(df3['date'].between(df3['start_date'], df3['end_date'], inclusive=True)].set_index('date')[['value', 'row']]
print(df3)
# value row
# date
# 2016-11-24 00:00:00 1759.199951 0
# 2016-11-24 00:10:00 992.400024 0
# 2016-11-24 01:00:00 82.499999 1
# 2016-11-24 01:10:00 37.400003 2
# 2016-11-24 01:20:00 159.899994 2
答案 4 :(得分:2)
我尝试使用实验性query
pandas方法see修改@ root代码。
对于非常大的dataFrame,它应该比原始实现更快。对于小型数据框,它肯定会更慢。
def root_searchsorted_modified(df1, df2):
# Add the end of the time interval to df1.
df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)
# Get the insertion indexes for the endpoints of the intervals from df1.
s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)
# ---- further is the MODIFIED code ----
# Filter df1 to only overlapping intervals.
df1.query('(date <= @s1.end_date.values) |\
(date_end <= @s1.end_date.values) |\
(@s1.index.values != @s2.index.values)', inplace=True)
# Drop the extra 'date_end' column.
return df1.drop('date_end', axis=1)