如何在pandas中按组扩展窗口

时间:2014-01-08 00:56:11

标签: python pandas

特别是我想按组进行系列中两个日期之间差异的扩展平均值。所以,如果我有这样的事情:

Period    Group    dates
  1         A      2010-07-01
  2         A      2010-07-13
  3         A      2010-07-13
  4         A      2010-07-21
  1         B      2000-08-20
  2         B      2000-08-15

我会得到:

Period    Group    cumulative average of differences
  1         A        0
  2         A        12/2
  3         A        12/3
  4         A        20/4
  1         B        0
  2         B       -5/2

2 个答案:

答案 0 :(得分:3)

import pandas as pd
import io

data ="""Period    Group    dates
1         A      2010-07-01
2         A      2010-07-13
3         A      2010-07-13
4         A      2010-07-21
1         B      2000-08-20
2         B      2000-08-15"""

df = pd.read_csv(io.BytesIO(data), delim_whitespace=True, parse_dates=[2])

def f(s):
    t = s.diff().fillna(0).astype(np.int64)
    return pd.expanding_mean(t).astype(np.int64).astype("timedelta64[ns]")

r = df.groupby("Group").dates.apply(f)
print r

输出:

0            00:00:00
1    6 days, 00:00:00
2    4 days, 00:00:00
3    5 days, 00:00:00
4            00:00:00
5   -2 days, 12:00:00
dtype: timedelta64[ns]

答案 1 :(得分:1)

我有一个替代解决方案,它比之前发布的解决方案略长,但我认为可能更容易理解日期列转换功能内部的内容,并且输出格式也有点清洁器:

import numpy as np
import pandas as pd
from datetime import date

# Build data
prd = [1, 2, 3, 4, 1, 2]
grp = ['A', 'A', 'A', 'A', 'B', 'B']
yr =  [2010, 2010, 2010, 2010, 2000, 2000]
mth = [7, 7, 7, 7, 8, 8]
day = [1, 13, 13, 21, 20, 15]
dt = [date(y, m, d) for y, m, d in zip(yr, mth, day)]
# Create data frame
df = pd.DataFrame({'Period': prd, 'Group': grp, 'Dates': dt},
                  columns=['Period', 'Group', 'Dates'])

# Transformation function for the date column
def f(ser):
    v = ser.values
    # Get time difference in days
    delta = [float((ii-v[0]).days) for ii in v]
    # Get number of items to divide by
    dv = np.arange(len(delta))+1
    # Get cumulative average
    cumavg = [nm/dm for nm, dm in zip(delta, dv)]
    # Create output pandas Series object and return it
    out = pd.Series(cumavg, index=ser.index)
    return out

# Apply the transformation function to the Dates column
dfappend = pd.DataFrame({'Cum_Avg': df.groupby("Group").Dates.apply(f)})
# Delete the Dates column
del df['Dates']
# Merge to create the revised data frame
df = pd.merge(df, dfappend, left_index=True, right_index=True)
print(df)

输出结果为:

   Period Group  Cum_Avg
0       1     A      0.0
1       2     A      6.0
2       3     A      4.0
3       4     A      5.0
4       1     B      0.0
5       2     B     -2.5