特别是我想按组进行系列中两个日期之间差异的扩展平均值。所以,如果我有这样的事情:
Period Group dates
1 A 2010-07-01
2 A 2010-07-13
3 A 2010-07-13
4 A 2010-07-21
1 B 2000-08-20
2 B 2000-08-15
我会得到:
Period Group cumulative average of differences
1 A 0
2 A 12/2
3 A 12/3
4 A 20/4
1 B 0
2 B -5/2
答案 0 :(得分:3)
import pandas as pd
import io
data ="""Period Group dates
1 A 2010-07-01
2 A 2010-07-13
3 A 2010-07-13
4 A 2010-07-21
1 B 2000-08-20
2 B 2000-08-15"""
df = pd.read_csv(io.BytesIO(data), delim_whitespace=True, parse_dates=[2])
def f(s):
t = s.diff().fillna(0).astype(np.int64)
return pd.expanding_mean(t).astype(np.int64).astype("timedelta64[ns]")
r = df.groupby("Group").dates.apply(f)
print r
输出:
0 00:00:00
1 6 days, 00:00:00
2 4 days, 00:00:00
3 5 days, 00:00:00
4 00:00:00
5 -2 days, 12:00:00
dtype: timedelta64[ns]
答案 1 :(得分:1)
我有一个替代解决方案,它比之前发布的解决方案略长,但我认为可能更容易理解日期列转换功能内部的内容,并且输出格式也有点清洁器:
import numpy as np
import pandas as pd
from datetime import date
# Build data
prd = [1, 2, 3, 4, 1, 2]
grp = ['A', 'A', 'A', 'A', 'B', 'B']
yr = [2010, 2010, 2010, 2010, 2000, 2000]
mth = [7, 7, 7, 7, 8, 8]
day = [1, 13, 13, 21, 20, 15]
dt = [date(y, m, d) for y, m, d in zip(yr, mth, day)]
# Create data frame
df = pd.DataFrame({'Period': prd, 'Group': grp, 'Dates': dt},
columns=['Period', 'Group', 'Dates'])
# Transformation function for the date column
def f(ser):
v = ser.values
# Get time difference in days
delta = [float((ii-v[0]).days) for ii in v]
# Get number of items to divide by
dv = np.arange(len(delta))+1
# Get cumulative average
cumavg = [nm/dm for nm, dm in zip(delta, dv)]
# Create output pandas Series object and return it
out = pd.Series(cumavg, index=ser.index)
return out
# Apply the transformation function to the Dates column
dfappend = pd.DataFrame({'Cum_Avg': df.groupby("Group").Dates.apply(f)})
# Delete the Dates column
del df['Dates']
# Merge to create the revised data frame
df = pd.merge(df, dfappend, left_index=True, right_index=True)
print(df)
输出结果为:
Period Group Cum_Avg
0 1 A 0.0
1 2 A 6.0
2 3 A 4.0
3 4 A 5.0
4 1 B 0.0
5 2 B -2.5