我下面有dataframe
:
data = pd.DataFrame({
'ID': ['27459', '27459', '27459', '27459', '27459', '27459', '27459', '48002', '48002', '48002'],
'Invoice_Date': ['2020-06-26', '2020-06-29', '2020-06-30', '2020-07-14', '2020-07-25',
'2020-07-30', '2020-08-02', '2020-05-13', '2020-06-20', '2020-06-28'],
'Payment_Term': [7,8,3,6,4,7,8,5,3,6],
'Payment_Date': ['2020-07-05', '2020-07-05','2020-07-03', '2020-07-21', '2020-07-31',
'2020-08-15', '2020-08-22', '2020-06-16', '2020-06-23', '2020-07-05'],
})
df = pd.DataFrame(data, columns = ['ID', 'Invoice_Date', 'Payment_Term', 'Payment_Date'])
df['Invoice_Date'] = pd.to_datetime(df['Invoice_Date'].astype(str), format='%Y-%m-%d')
df['Payment_Date'] = pd.to_datetime(df['Payment_Date'].astype(str), format='%Y-%m-%d')
df['Due_Date'] = df['Invoice_Date'] + pd.to_timedelta(df['Payment_Term'], unit = 'd')
df['Delay'] = df['Payment_Date'] - df['Due_Date']
df['Delay'] = df['Delay'].dt.days
df['diff'] = df.groupby('ID')['Invoice_Date'].diff() / np.timedelta64(1, 'D')
df['diff'] = df['diff'].fillna(0)
def func(x):
x = x.values
values = [x[0]]
for i in range(1, len(x)):
value = values[i-1] + x[i]
if value < 30:
values.append(value)
elif x[i] >= 30:
values.append(0)
else:
values.append(x[i])
return values
df['days'] = df.groupby("ID")["diff"].transform(func)
df
Out[1]:
ID Invoice_Date Payment_Term Payment_Date Due_Date Delay diff days
0 27459 2020-06-26 7 2020-07-05 2020-07-03 2 0.0 0.0
1 27459 2020-06-29 8 2020-07-05 2020-07-07 -2 3.0 3.0
2 27459 2020-06-30 3 2020-07-03 2020-07-03 0 1.0 4.0
3 27459 2020-07-14 6 2020-07-21 2020-07-20 1 14.0 18.0
4 27459 2020-07-25 4 2020-07-31 2020-07-29 2 11.0 29.0
5 27459 2020-07-30 7 2020-08-15 2020-08-06 9 5.0 5.0
6 27459 2020-08-02 8 2020-08-22 2020-08-10 12 3.0 8.0
7 48002 2020-05-13 5 2020-06-16 2020-05-18 29 0.0 0.0
8 48002 2020-06-20 3 2020-06-23 2020-06-23 0 38.0 0.0
9 48002 2020-06-28 6 2020-07-05 2020-07-04 1 8.0 8.0
我想创建一列Mean
,其中的计算是Delay
的总和除以基于ID
的30天之内的发票数。
例如,ID为Invoice_Date
的初始27459
为2020-06-26,因此30天的时间将一直持续到2020-07-25,并且均值将根据{{ 1}}从该日期时间开始。
棘手的是,实际上Delay
中有两种方法。我尝试使用ID
,但这仅在我需要从同一ID组中查找均值时适用。
预期输出应大致如下所示:
groupby.mean
答案 0 :(得分:0)
只是一个建议:
之后
df['Delay'] = df['Delay'].dt.days
尝试
df['_m_grp'] = (df.groupby('ID')['Invoice_Date'].diff()
/ np.timedelta64(1, 'D')).fillna(0)
df['_m_grp'] = df.groupby('ID')['_m_grp'].cumsum() // 30
df['Mean'] = df.groupby(by=['ID', '_m_grp'])['Delay'].transform(np.mean)
结果看起来像(删除了一些列):
ID Invoice_Date Delay Mean
0 27459 2020-06-26 2 0.6
1 27459 2020-06-29 -2 0.6
2 27459 2020-06-30 0 0.6
3 27459 2020-07-14 1 0.6
4 27459 2020-07-25 2 0.6
5 27459 2020-07-30 9 10.5
6 27459 2020-08-02 12 10.5
7 48002 2020-05-13 29 29.0
8 48002 2020-06-20 0 0.5
9 48002 2020-06-28 1 0.5
如果您真的想用这种方法进行选择性填充,请尝试:
df['_m_grp'] = (df.groupby('ID')['Invoice_Date'].diff()
/ np.timedelta64(1, 'D')).fillna(0)
df['_m_grp'] = df.groupby('ID')['_m_grp'].cumsum() // 30
df['_m_fill'] = df.groupby('ID')['_m_grp'].diff(-1).fillna(-1.)
df['Mean'] = (df.groupby(by=['ID', '_m_grp'])['Delay']
.transform(np.mean)[df['_m_fill'] == -1.])
结果:
ID Invoice_Date Delay Mean
0 27459 2020-06-26 2 NaN
1 27459 2020-06-29 -2 NaN
2 27459 2020-06-30 0 NaN
3 27459 2020-07-14 1 NaN
4 27459 2020-07-25 2 0.6
5 27459 2020-07-30 9 NaN
6 27459 2020-08-02 12 10.5
7 48002 2020-05-13 29 29.0
8 48002 2020-06-20 0 NaN
9 48002 2020-06-28 1 0.5