我有一个大数据框,我需要从中滑动给定一组查询点的时间窗口平均值。我试过 df.rolling
但这不允许我查询任意点。以下工作,但似乎效率低下,不允许矢量化使用:
import pandas as pd
df = pd.DataFrame({'B': range(5)},
index = [pd.Timestamp('20130101 09:00:00'),
pd.Timestamp('20130101 09:00:02'),
pd.Timestamp('20130101 09:00:03'),
pd.Timestamp('20130101 09:00:05'),
pd.Timestamp('20130101 09:00:06')])
query = pd.date_range(df.index[0], df.index[-1], freq='s')
time_window = pd.Timedelta(seconds=2)
f = lambda t: df[(t - time_window < df.index) & (df.index <= t)]["B"].mean()
[f(t) for t in query] # works but is slow
f(query) # throws ValueError length must match
也许这可以做得更好......
编辑:真实的应用程序具有在 30 到 90 秒之间随机出现的度量。有时有几天或几周没有数据的时期。 time_window
通常为 15 分钟。总体时间跨度为 10 年。
答案 0 :(得分:0)
你只是跳过了一小步。
您的“查询”实际上是一个时间序列重采样操作。也就是说,除了计算滚动平均值之外,您还试图以一秒的频率平滑地重新采样时间序列。您可以使用 asfreq
方法做到这一点,在滚动操作之前应用它:
resample_rolling = df.asfreq('1s').rolling(pd.Timedelta(seconds=2)).mean()
print(np.array([f(t) for t in query]))
print(resample_rolling.to_numpy()[:, 0])
输出:
[0. 0. 1. 1.5 2. 3. 3.5]
[0. 0. 1. 1.5 2. 3. 3.5]
请注意,默认情况下,asfreq
方法使用 nan
值填充缺失值。
>>> df.asfreq(pd.Timedelta(seconds=1))
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:01 NaN
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 2.0
2013-01-01 09:00:04 NaN
2013-01-01 09:00:05 3.0
2013-01-01 09:00:06 4.0
然后滚动操作会忽略这些值。相反,如果您想用 nan
以外的其他内容填充值,您有两个选择。您可以提供一个 fill_value
:
>>> df.asfreq('1s', fill_value=0.0)
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:01 0.0
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 2.0
2013-01-01 09:00:04 0.0
2013-01-01 09:00:05 3.0
2013-01-01 09:00:06 4.0
或者您可以指定一个 method
,例如 backfill
,它使用系列中的下一个值:
>>> df.asfreq('1s', method='backfill')
B
2013-01-01 09:00:00 0
2013-01-01 09:00:01 1
2013-01-01 09:00:02 1
2013-01-01 09:00:03 2
2013-01-01 09:00:04 3
2013-01-01 09:00:05 3
2013-01-01 09:00:06 4
当然,由此产生的滚动平均值是不同的:
>>> df.asfreq('1s', method='backfill').rolling('1s').mean()
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:01 1.0
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 2.0
2013-01-01 09:00:04 3.0
2013-01-01 09:00:05 3.0
2013-01-01 09:00:06 4.0
答案 1 :(得分:0)
经过一番研究,我想出了以下解决方案,其中包含两个滚动窗口,一个用于进入窗口,一个用于离开:
import pandas as pd, numpy as np
df = pd.DataFrame({'B': range(5)},
index = [pd.Timestamp('20130101 09:00:00'),
pd.Timestamp('20130101 09:00:02'),
pd.Timestamp('20130101 09:00:03'),
pd.Timestamp('20130101 09:00:05'),
pd.Timestamp('20130101 09:00:06')])
query = pd.date_range(df.index[0], df.index[-1], freq='s')
time_window = pd.Timedelta(seconds=2)
aggregates = ['mean']
### Preparation
# one data point for each point entering the window
df1 = df.rolling(window=time_window, closed='right').agg(aggregates)
# one data point for each point leaving the window - use reverted df
df2 = df[::-1].rolling(window=time_window, closed='left').agg(aggregates)
df2.index += time_window
# Caution: for my real data in the reverted rolling method, I had
# to add a small Timedelta to window to function properly
# merge both together and remove duplicates
df_windowed = pd.concat([df1, df2])
df_windowed.sort_index(inplace=True)
df_windowed = df_windowed[~df_windowed.index.duplicated(keep='first')]
### the vectorized function
# Caution: get_indexer returns -1 for not found values (below df.index.min()),
# which is interpreted as last value. But last value of df_windows is always NaN
f = lambda t: df_windowed.iloc[
df_windowed.index.get_indexer(t, method='ffill') if isinstance(t, (pd.Index, pd.Series, np.ndarray,)) else
df_windowed.index.get_loc(t, method='ffill')
]["B"]["mean"].to_numpy()
f(query)