比较当前数据框值以汇总熊猫以前时间步的值

时间:2019-01-18 14:18:20

标签: python pandas

我有一个以时间间隔15分钟为索引的熊猫时间序列。在每个间隔中,我都有多个列abc

| index   | a | b | c |
| 9:00 am | 2 | 2 | 4 |
| 9:15 am | 2 | 2 | 4 |
...

我需要比较a在回到当前时间步的第1、2、3和4周的平均值。因此,如果我现在的时间是上午9:15,那么我需要找到距上一周,后2周,3周和4周的上午9:15的a的平均值。

很显然,由于没有足够的历史记录,因此无法在数据集的前4周进行计算。我一直在思考如何考虑将数据帧移至过去以汇总这些值,然后与未来进行比较。

this question有一些相似之处,但是索引不是时间序列,并且比较起来更简单。

1 个答案:

答案 0 :(得分:0)

Here I do it with days instead of weeks. I start with making dummy data based on your example:

import pandas as pd
import random
d = [
    {"ts":pd.Timestamp(year=2017, month=1, day=1, hour=12,
                 minute=0, second=0) + pd.Timedelta(x*15, unit="s"),
    "a": random.randint(2, 5),
    "b": random.randint(2, 5),
    "c": random.randint(2, 5),} for x in range(0, 30000)
]
dft = pd.DataFrame(d).set_index("ts")

I define a handler function that tries to get a value exactly 0, 1, 2, and 3 days from the row. Since I'll get a key error for the first 4 days there's a try-except with np.NaN. Note the Timedelta(unit=) kwarg. You can change that to get this effect for other units - I think this would be less error-prone than tweaking the call to range.

def handler(row):
  try: 
    m = np.mean([dft.loc[row.name-pd.Timedelta(x, unit="d")][0] for x in range(4)])
  except KeyError as e:
    return np.NaN
  return m

Finally, use apply.

dft.apply(handler, axis=1)

It's fairly slow, so I'll try to think of a faster way but for now I think this is it.