通过落在多个范围内的时间戳快速过滤

时间:2018-01-17 12:19:08

标签: python pandas

我正在尝试快速过滤包含一系列时间戳的pandas数据框,这样时间会随着一组范围(在不同的数据框中)而下降。

目前我使用过滤功能并使用apply,但有效但速度很慢。我错过了一个明显的解决方案吗?

import pandas as pd

# Dataframe of timestamps
df = pd.DataFrame(pd.date_range('20180117', end='20180118', freq='60s'),
                  columns=['time'])

# Dataframe of intervals
df2= pd.DataFrame([[pd.Timestamp('201801170005'), pd.Timestamp('201801170020')],
                   [pd.Timestamp('201801171415'), pd.Timestamp('201801171430')],
                   [pd.Timestamp('201801171800'), pd.Timestamp('201801171900')]],
                  columns=['start','end'])

def flag_during(timestamp, df):
    """Test if a timestamp falls between any [start, end] pairs"""
    return any((df['start']<timestamp) & (timestamp<df['end']))

# Add a column indicating if the time falls within any interval
%timeit df['During'] = df['time'].apply(lambda l: flag_during(l, df2))

Timeit返回

682 ms ± 2.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

结果

我实施了How to join two dataframes for which column values are within a certain range?(或accompanying answer)的解决方案,并提出了以下建议:

%%timeit
idx = pd.IntervalIndex.from_arrays(df2['start'], df2['end'], closed='both')
event =  ~pd.isna(df2['start'].reindex(idx.get_indexer(df['time'])))
df['During2'] = event.values

返回

826 µs ± 7.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

快了大约1000倍。

0 个答案:

没有答案