Pandas:根据另一个DF选择DF行

时间:2016-12-07 15:11:11

标签: python pandas

我有两个数据帧(非常长,每行数百或数千行)。其中一个名为df1,包含一个时间序列,间隔为10分钟。例如:

               date          value
2016-11-24 00:00:00    1759.199951
2016-11-24 00:10:00     992.400024
2016-11-24 00:20:00    1404.800049
2016-11-24 00:30:00      45.799999
2016-11-24 00:40:00      24.299999
2016-11-24 00:50:00     159.899994
2016-11-24 01:00:00      82.499999
2016-11-24 01:10:00      37.400003
2016-11-24 01:20:00     159.899994
....

另一个df2包含日期时间间隔:

              start_date             end_date
0    2016-11-23 23:55:32  2016-11-24 00:14:03
1    2016-11-24 01:03:18  2016-11-24 01:07:12
2    2016-11-24 01:11:32  2016-11-24 02:00:00 
...

我需要选择df1中的所有行" fall"进入df2的区间。

通过这些示例,结果数据框应为:

               date          value
2016-11-24 00:00:00    1759.199951   # Fits in row 0 of df2
2016-11-24 00:10:00     992.400024   # Fits in row 0 of df2
2016-11-24 01:00:00      82.499999   # Fits in row 1 of df2
2016-11-24 01:10:00      37.400003   # Fits on row 2 of df2
2016-11-24 01:20:00     159.899994   # Fits in row 2 of df2
....

5 个答案:

答案 0 :(得分:7)

使用np.searchsorted

此处'基于np.searchsorted的变体似乎比使用intervaltreemerge快一个数量级,假设我的较大样本数据是正确的。

# Ensure the df2 is sorted (skip if it's already known to be).
df2 = df2.sort_values(by=['start_date', 'end_date'])

# Add the end of the time interval to df1.
df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)

# Perform the searchsorted and get the corresponding df2 values for both endpoints of df1.
s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)

# Build the conditions that indicate an overlap (any True condition indicates an overlap).
cond = [
    df1['date'].values <= s1['end_date'].values,
    df1['date_end'].values <= s2['end_date'].values,
    s1.index.values != s2.index.values
    ]

# Filter df1 to only the overlapping intervals, and drop the extra 'date_end' column.
df1 = df1[np.any(cond, axis=0)].drop('date_end', axis=1)

如果df2中的间隔嵌套或重叠,则可能需要修改;在那种情况下,我还没有完全考虑过它,但它仍然有用。

使用间隔树

不完全是纯粹的Pandas解决方案,但您可能需要考虑从df2构建Interval Tree,并根据df1中的间隔查询它以查找重叠的那些。

PyPI上的intervaltree包似乎具有良好的性能和易于使用的语法。

from intervaltree import IntervalTree

# Build the Interval Tree from df2.
tree = IntervalTree.from_tuples(df2.astype('int64').values + [0, 1])

# Build the 10 minutes spans from df1.
dt_pairs = pd.concat([df1['date'], df1['date'] + pd.offsets.Minute(10)], axis=1)

# Query the Interval Tree to filter df1.
df1 = df1[[tree.overlaps(*p) for p in dt_pairs.astype('int64').values]]

出于性能原因,我将日期转换为等值整数。我怀疑intervaltree包的构建时考虑了pd.Timestamp,因此可能会有一些中间转换步骤会使事情变慢。

另请注意,intervaltree包中的间隔不包括终点,但包含起点。这就是我在创建+ [0, 1]时拥有tree的原因;我将终点填充纳秒,以确保实际包含终点。这也是为什么在查询树时添加pd.offsets.Minute(10)来获取间隔结束而不是仅添加9m 59s的原因。

任一方法的结果输出:

                 date        value
0 2016-11-24 00:00:00  1759.199951
1 2016-11-24 00:10:00   992.400024
6 2016-11-24 01:00:00    82.499999
7 2016-11-24 01:10:00    37.400003
8 2016-11-24 01:20:00   159.899994

计时

使用以下设置生成更大的样本数据:

# Sample df1.
n1 = 55000
df1 = pd.DataFrame({'date': pd.date_range('2016-11-24', freq='10T', periods=n1), 'value': np.random.random(n1)})

# Sample df2.
n2 = 500
df2 = pd.DataFrame({'start_date': pd.date_range('2016-11-24', freq='18H22T', periods=n2)})

# Randomly shift the start and end dates of the df2 intervals.
shift_start = pd.Series(np.random.randint(30, size=n2)).cumsum().apply(lambda s: pd.DateOffset(seconds=s))
shift_end1 = pd.Series(np.random.randint(30, size=n2)).apply(lambda s: pd.DateOffset(seconds=s))
shift_end2 = pd.Series(np.random.randint(5, 45, size=n2)).apply(lambda m: pd.DateOffset(minutes=m))
df2['start_date'] += shift_start
df2['end_date'] = df2['start_date'] + shift_end1 + shift_end2

df1df2生成以下内容:

df1
                  date     value
0     2016-11-24 00:00:00  0.444939
1     2016-11-24 00:10:00  0.407554
2     2016-11-24 00:20:00  0.460148
3     2016-11-24 00:30:00  0.465239
4     2016-11-24 00:40:00  0.462691
...
54995 2017-12-10 21:50:00  0.754123
54996 2017-12-10 22:00:00  0.401820
54997 2017-12-10 22:10:00  0.146284
54998 2017-12-10 22:20:00  0.394759
54999 2017-12-10 22:30:00  0.907233

df2
              start_date            end_date
0   2016-11-24 00:00:19 2016-11-24 00:41:24
1   2016-11-24 18:22:44 2016-11-24 18:36:44
2   2016-11-25 12:44:44 2016-11-25 13:03:13
3   2016-11-26 07:07:05 2016-11-26 07:49:29
4   2016-11-27 01:29:31 2016-11-27 01:34:32
...
495 2017-12-07 21:36:04 2017-12-07 22:14:29
496 2017-12-08 15:58:14 2017-12-08 16:10:35
497 2017-12-09 10:20:21 2017-12-09 10:26:40
498 2017-12-10 04:42:41 2017-12-10 05:22:47
499 2017-12-10 23:04:42 2017-12-10 23:44:53

使用以下功能进行计时:

def root_searchsorted(df1, df2):
    # Add the end of the time interval to df1.
    df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)

    # Get the insertion indexes for the endpoints of the intervals from df1.
    s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
    s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)

    # Build the conditions that indicate an overlap (any True condition indicates an overlap).
    cond = [
        df1['date'].values <= s1['end_date'].values,
        df1['date_end'].values <= s2['end_date'].values,
        s1.index.values != s2.index.values
        ]

    # Filter df1 to only the overlapping intervals, and drop the extra 'date_end' column.
    return df1[np.any(cond, axis=0)].drop('date_end', axis=1)

def root_intervaltree(df1, df2):
    # Build the Interval Tree.
    tree = IntervalTree.from_tuples(df2.astype('int64').values + [0, 1])

    # Build the 10 minutes spans from df1.
    dt_pairs = pd.concat([df1['date'], df1['date'] + pd.offsets.Minute(10)], axis=1)

    # Query the Interval Tree to filter the DataFrame.
    return df1[[tree.overlaps(*p) for p in dt_pairs.astype('int64').values]]

def ptrj(df1, df2):
    # The smallest amount of time - handy when using open intervals:
    epsilon = pd.Timedelta(1, 'ns')

    # Lookup series (`asof` works best with series) for `start_date` and `end_date` from `df2`:
    sdate = pd.Series(data=range(df2.shape[0]), index=df2.start_date)
    edate = pd.Series(data=range(df2.shape[0]), index=df2.end_date + epsilon)

    # (filling NaN's with -1)
    l = edate.asof(df1.date).fillna(-1)
    r = sdate.asof(df1.date + (pd.Timedelta(10, 'm') - epsilon)).fillna(-1)
    # (taking `values` here to skip indexes, which are different)
    mask = l.values < r.values

    return df1[mask]

def parfait(df1, df2):
    df1['key'] = 1
    df2['key'] = 1
    df2['row'] = df2.index.values

    # CROSS JOIN
    df3 = pd.merge(df1, df2, on=['key'])

    # DF FILTERING
    return df3[df3['start_date'].between(df3['date'], df3['date'] + dt.timedelta(minutes=9, seconds=59), inclusive=True) | df3['date'].between(df3['start_date'], df3['end_date'], inclusive=True)].set_index('date')[['value', 'row']]

def root_searchsorted_modified(df1, df2):
    # Add the end of the time interval to df1.
    df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)

    # Get the insertion indexes for the endpoints of the intervals from df1.
    s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
    s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)

    # ---- further is the MODIFIED code ----
    # Filter df1 to only overlapping intervals.
    df1.query('(date <= @s1.end_date.values) |\
               (date_end <= @s1.end_date.values) |\
               (@s1.index.values != @s2.index.values)', inplace=True)

    # Drop the extra 'date_end' column.
    return df1.drop('date_end', axis=1)

我得到以下时间:

%timeit root_searchsorted(df1.copy(), df2.copy())
100 loops best of 3: 9.55 ms per loop

%timeit root_searchsorted_modified(df1.copy(), df2.copy())
100 loops best of 3: 13.5 ms per loop

%timeit ptrj(df1.copy(), df2.copy())
100 loops best of 3: 18.5 ms per loop

%timeit root_intervaltree(df1.copy(), df2.copy())
1 loop best of 3: 4.02 s per loop

%timeit parfait(df1.copy(), df2.copy())
1 loop best of 3: 8.96 s per loop

答案 1 :(得分:3)

此解决方案(我相信它有效)使用pandas.Series.asof。在引擎盖下,它是搜索的一些版本 - 但由于某种原因,它的速度比快四倍,它的速度与@ root&amp; s函数相当

我假设所有日期列都是pandas datetime格式,已排序,且df2间隔不重叠

代码很短但有点错综复杂(下面的解释)。

# The smallest amount of time - handy when using open intervals: 
epsilon = pd.Timedelta(1, 'ns')
# Lookup series (`asof` works best with series) for `start_date` and `end_date` from `df2`:
sdate = pd.Series(data=range(df2.shape[0]), index=df2.start_date)
edate = pd.Series(data=range(df2.shape[0]), index=df2.end_date + epsilon)

# The main function (see explanation below):
def get_it(df1):
    # (filling NaN's with -1)
    l = edate.asof(df1.date).fillna(-1)
    r = sdate.asof(df1.date + (pd.Timedelta(10, 'm') - epsilon)).fillna(-1)
    # (taking `values` here to skip indexes, which are different)
    mask = l.values < r.values
    return df1[mask]

这种方法的优点有两个:sdateedate仅评估一次,如果df1非常大,主函数可以采用df1的块。< / p>

<强>解释

pandas.Series.asof返回给定索引的最后一个有效行。它可以将数组作为输入并且非常快。

为了便于解释,请s[j] = sdate.index[j]sdate中的第j个日期,x为某个任意日期(时间戳)。 始终有s[sdate.asof(x)] <= x(这正是asof的工作原理)并且不难显示:

  1. j <= sdate.asof(x)当且仅当s[j] <= x
  2. sdate.asof(x) < j当且仅当x < s[j]
  3. 同样适用于edate。不幸的是,我们在1和2中都不能有相同的不等式(无论是一周还是严格)。

    两个间隔[a,b)和[x,y]相交iff x < b和a&lt; = y。 (我们可能会认为a,b来自sdate.indexedate.index - 由于属性1和2,区间[a,b]被选择为闭合开放。) 在我们的例子中,x是df1的日期,y = x + 10min - epsilon, a = s [j],b = e [j](注意epsilon已被添加到edate),其中j是某个数字。

    所以,最后,相当于&#34; [a,b)和[x,y]的条件相交&#34;是 &#34; sdate.asof(x)&lt; j和j&lt; = edate.asof(y)对于某些数字j&#34;。它大致归结为函数l < r内的get_it(以某些技术性为模)。

答案 2 :(得分:2)

这并不简单,但您可以执行以下操作:

首先从两个数据框中获取相关日期列并将它们连接在一起,以便一列是所有日期,另外两列是表示来自df2的索引的列。 (注意df2在堆叠后获得多索引)

dfm = pd.concat((df1['date'],df2.stack().reset_index())).sort_values(0)

print(dfm)

                    0  level_0     level_1
0 2016-11-23 23:55:32      0.0  start_date
0 2016-11-24 00:00:00      NaN         NaN
1 2016-11-24 00:10:00      NaN         NaN
1 2016-11-24 00:14:03      0.0    end_date
2 2016-11-24 00:20:00      NaN         NaN
3 2016-11-24 00:30:00      NaN         NaN
4 2016-11-24 00:40:00      NaN         NaN
5 2016-11-24 00:50:00      NaN         NaN
6 2016-11-24 01:00:00      NaN         NaN
2 2016-11-24 01:03:18      1.0  start_date
3 2016-11-24 01:07:12      1.0    end_date
7 2016-11-24 01:10:00      NaN         NaN
4 2016-11-24 01:11:32      2.0  start_date
8 2016-11-24 01:20:00      NaN         NaN
5 2016-11-24 02:00:00      2.0    end_date

您可以看到df1中的值在右侧两列中有NaN,由于我们对日期进行了排序,因此这些行位于start_dateend_date行之间(来自DF2)。

为了表明df1中的行落在df2的行之间,我们可以插入level_0列,它给出了我们:

dfm['level_0'] = dfm['level_0'].interpolate()

                    0   level_0     level_1
0 2016-11-23 23:55:32  0.000000  start_date
0 2016-11-24 00:00:00  0.000000         NaN
1 2016-11-24 00:10:00  0.000000         NaN
1 2016-11-24 00:14:03  0.000000    end_date
2 2016-11-24 00:20:00  0.166667         NaN
3 2016-11-24 00:30:00  0.333333         NaN
4 2016-11-24 00:40:00  0.500000         NaN
5 2016-11-24 00:50:00  0.666667         NaN
6 2016-11-24 01:00:00  0.833333         NaN
2 2016-11-24 01:03:18  1.000000  start_date
3 2016-11-24 01:07:12  1.000000    end_date
7 2016-11-24 01:10:00  1.500000         NaN
4 2016-11-24 01:11:32  2.000000  start_date
8 2016-11-24 01:20:00  2.000000         NaN
5 2016-11-24 02:00:00  2.000000    end_date

请注意,level_0列现在包含介于开始日期和结束日期之间的行的整数(数学上,而不是数据类型)(这假定结束日期不会与下一个开始日期重叠) )。

现在我们可以过滤出最初在df1中的行:

df_falls = dfm[(dfm['level_0'] == dfm['level_0'].astype(int)) & (dfm['level_1'].isnull())][[0,'level_0']]
df_falls.columns = ['date', 'falls_index']

并与原始数据框合并

df_final = pd.merge(df1, right=df_falls, on='date', how='outer')

给出:

print(df_final)

                 date        value  falls_index
0 2016-11-24 00:00:00  1759.199951          0.0
1 2016-11-24 00:10:00   992.400024          0.0
2 2016-11-24 00:20:00  1404.800049          NaN
3 2016-11-24 00:30:00    45.799999          NaN
4 2016-11-24 00:40:00    24.299999          NaN
5 2016-11-24 00:50:00   159.899994          NaN
6 2016-11-24 01:00:00    82.499999          NaN
7 2016-11-24 01:10:00    37.400003          NaN
8 2016-11-24 01:20:00   159.899994          2.0

与具有额外列falls_index的原始数据帧相同,后者表示该行所属的df2行的索引。

答案 3 :(得分:2)

考虑交叉连接合并,它返回两个集合之间的笛卡尔积(所有可能的行对M x N)。您可以使用merge on参数中的全1键列来交叉连接。然后,使用pd.series.between()在大型返回集上运行过滤器。具体而言,系列between()会保留开始日期在datedate的9:59范围内的行落在开始和结束时间内。

但是,在合并之前,创建一个等于日期索引的df1['date']列,以便它在合并后可以是保留列并用于日期过滤。此外,创建一个df2['row']列,以便在结尾处用作行指示符。对于演示,下面重新创建已发布的df1和df2数据帧:

from io import StringIO
import pandas as pd
import datetime as dt

data1 = '''
date                     value
"2016-11-24 00:00:00"    1759.199951
"2016-11-24 00:10:00"     992.400024
"2016-11-24 00:20:00"    1404.800049
"2016-11-24 00:30:00"      45.799999
"2016-11-24 00:40:00"      24.299999
"2016-11-24 00:50:00"     159.899994
"2016-11-24 01:00:00"      82.499999
"2016-11-24 01:10:00"      37.400003
"2016-11-24 01:20:00"     159.899994
'''    
df1 = pd.read_table(StringIO(data1), sep='\s+', parse_dates=[0], index_col=0)
df1['key'] = 1
df1['date'] = df1.index.values

data2 = '''
start_date  end_date
"2016-11-23 23:55:32"  "2016-11-24 00:14:03"
"2016-11-24 01:03:18"  "2016-11-24 01:07:12"
"2016-11-24 01:11:32"  "2016-11-24 02:00:00"
'''    
df2['key'] = 1
df2['row'] = df2.index.values
df2 = pd.read_table(StringIO(data2), sep='\s+', parse_dates=[0,1])

# CROSS JOIN
df3 = pd.merge(df1, df2, on=['key'])

# DF FILTERING
df3 = df3[(df3['start_date'].between(df3['date'], df3['date'] + dt.timedelta(minutes=9), seconds=59), inclusive=True)) |
          (df3['date'].between(df3['start_date'], df3['end_date'], inclusive=True)].set_index('date')[['value', 'row']]

print(df3)
#                            value  row
# date                                 
# 2016-11-24 00:00:00  1759.199951    0
# 2016-11-24 00:10:00   992.400024    0
# 2016-11-24 01:00:00    82.499999    1
# 2016-11-24 01:10:00    37.400003    2
# 2016-11-24 01:20:00   159.899994    2

答案 4 :(得分:2)

我尝试使用实验性query pandas方法see修改@ root代码。 对于非常大的dataFrame,它应该比原始实现更快。对于小型数据框,它肯定会更慢。

def root_searchsorted_modified(df1, df2):
    # Add the end of the time interval to df1.
    df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)

    # Get the insertion indexes for the endpoints of the intervals from df1.
    s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
    s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)

    # ---- further is the MODIFIED code ----
    # Filter df1 to only overlapping intervals.
    df1.query('(date <= @s1.end_date.values) |\
               (date_end <= @s1.end_date.values) |\
               (@s1.index.values != @s2.index.values)', inplace=True)

    # Drop the extra 'date_end' column.
    return df1.drop('date_end', axis=1)