我有一个以时间间隔15分钟为索引的熊猫时间序列。在每个间隔中,我都有多个列a
,b
和c
。
| index | a | b | c |
| 9:00 am | 2 | 2 | 4 |
| 9:15 am | 2 | 2 | 4 |
...
我需要比较a
在回到当前时间步的第1、2、3和4周的平均值。因此,如果我现在的时间是上午9:15,那么我需要找到距上一周,后2周,3周和4周的上午9:15的a
的平均值。
很显然,由于没有足够的历史记录,因此无法在数据集的前4周进行计算。我一直在思考如何考虑将数据帧移至过去以汇总这些值,然后与未来进行比较。
与this question有一些相似之处,但是索引不是时间序列,并且比较起来更简单。
答案 0 :(得分:0)
Here I do it with days instead of weeks. I start with making dummy data based on your example:
import pandas as pd
import random
d = [
{"ts":pd.Timestamp(year=2017, month=1, day=1, hour=12,
minute=0, second=0) + pd.Timedelta(x*15, unit="s"),
"a": random.randint(2, 5),
"b": random.randint(2, 5),
"c": random.randint(2, 5),} for x in range(0, 30000)
]
dft = pd.DataFrame(d).set_index("ts")
I define a handler function that tries to get a value exactly 0, 1, 2, and 3 days from the row. Since I'll get a key error for the first 4 days there's a try-except with np.NaN
. Note the Timedelta(unit=)
kwarg. You can change that to get this effect for other units - I think this would be less error-prone than tweaking the call to range
.
def handler(row):
try:
m = np.mean([dft.loc[row.name-pd.Timedelta(x, unit="d")][0] for x in range(4)])
except KeyError as e:
return np.NaN
return m
Finally, use apply
.
dft.apply(handler, axis=1)
It's fairly slow, so I'll try to think of a faster way but for now I think this is it.