我正在尝试快速过滤包含一系列时间戳的pandas数据框,这样时间会随着一组范围(在不同的数据框中)而下降。
目前我使用过滤功能并使用apply
,但有效但速度很慢。我错过了一个明显的解决方案吗?
import pandas as pd
# Dataframe of timestamps
df = pd.DataFrame(pd.date_range('20180117', end='20180118', freq='60s'),
columns=['time'])
# Dataframe of intervals
df2= pd.DataFrame([[pd.Timestamp('201801170005'), pd.Timestamp('201801170020')],
[pd.Timestamp('201801171415'), pd.Timestamp('201801171430')],
[pd.Timestamp('201801171800'), pd.Timestamp('201801171900')]],
columns=['start','end'])
def flag_during(timestamp, df):
"""Test if a timestamp falls between any [start, end] pairs"""
return any((df['start']<timestamp) & (timestamp<df['end']))
# Add a column indicating if the time falls within any interval
%timeit df['During'] = df['time'].apply(lambda l: flag_during(l, df2))
Timeit返回
682 ms ± 2.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
我实施了How to join two dataframes for which column values are within a certain range?(或accompanying answer)的解决方案,并提出了以下建议:
%%timeit
idx = pd.IntervalIndex.from_arrays(df2['start'], df2['end'], closed='both')
event = ~pd.isna(df2['start'].reindex(idx.get_indexer(df['time'])))
df['During2'] = event.values
返回
826 µs ± 7.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
快了大约1000倍。