根据熊猫列中的条件计算平均值

时间:2020-11-11 15:29:28

标签: python pandas function dataframe

我下面有dataframe

data = pd.DataFrame({
        'ID':  ['27459', '27459', '27459', '27459', '27459', '27459', '27459', '48002', '48002', '48002'],
        'Invoice_Date': ['2020-06-26', '2020-06-29', '2020-06-30', '2020-07-14', '2020-07-25', 
                         '2020-07-30', '2020-08-02', '2020-05-13', '2020-06-20', '2020-06-28'],
        'Payment_Term': [7,8,3,6,4,7,8,5,3,6],
        'Payment_Date': ['2020-07-05', '2020-07-05','2020-07-03', '2020-07-21', '2020-07-31', 
                         '2020-08-15', '2020-08-22', '2020-06-16', '2020-06-23', '2020-07-05'],
        })

df = pd.DataFrame(data, columns = ['ID', 'Invoice_Date', 'Payment_Term', 'Payment_Date'])

df['Invoice_Date'] = pd.to_datetime(df['Invoice_Date'].astype(str), format='%Y-%m-%d')
df['Payment_Date'] = pd.to_datetime(df['Payment_Date'].astype(str), format='%Y-%m-%d')
df['Due_Date'] = df['Invoice_Date'] + pd.to_timedelta(df['Payment_Term'], unit = 'd') 
df['Delay'] = df['Payment_Date'] - df['Due_Date']
df['Delay'] = df['Delay'].dt.days                                                
df['diff'] = df.groupby('ID')['Invoice_Date'].diff() / np.timedelta64(1, 'D')
df['diff'] = df['diff'].fillna(0)

def func(x):
    x = x.values
    values = [x[0]]
    for i in range(1, len(x)):
        value = values[i-1] + x[i]
        if value < 30:
            values.append(value)
        elif x[i] >= 30:
            values.append(0)
        else:
            values.append(x[i])
    return values


df['days'] = df.groupby("ID")["diff"].transform(func)
df

Out[1]:

      ID Invoice_Date  Payment_Term Payment_Date   Due_Date  Delay  diff  days
0  27459   2020-06-26             7   2020-07-05 2020-07-03      2   0.0   0.0
1  27459   2020-06-29             8   2020-07-05 2020-07-07     -2   3.0   3.0
2  27459   2020-06-30             3   2020-07-03 2020-07-03      0   1.0   4.0
3  27459   2020-07-14             6   2020-07-21 2020-07-20      1  14.0  18.0
4  27459   2020-07-25             4   2020-07-31 2020-07-29      2  11.0  29.0
5  27459   2020-07-30             7   2020-08-15 2020-08-06      9   5.0   5.0
6  27459   2020-08-02             8   2020-08-22 2020-08-10     12   3.0   8.0
7  48002   2020-05-13             5   2020-06-16 2020-05-18     29   0.0   0.0
8  48002   2020-06-20             3   2020-06-23 2020-06-23      0  38.0   0.0
9  48002   2020-06-28             6   2020-07-05 2020-07-04      1   8.0   8.0

我想创建一列Mean,其中的计算是Delay的总和除以基于ID的30天之内的发票数。

例如,ID为Invoice_Date的初始27459为2020-06-26,因此30天的时间将一直持续到2020-07-25,并且均值将根据{{ 1}}从该日期时间开始。

棘手的是,实际上Delay中有两种方法。我尝试使用ID,但这仅在我需要从同一ID组中查找均值时适用。

预期输出应大致如下所示:

groupby.mean

1 个答案:

答案 0 :(得分:0)

只是一个建议

之后

df['Delay'] = df['Delay'].dt.days

尝试

df['_m_grp'] = (df.groupby('ID')['Invoice_Date'].diff()
                / np.timedelta64(1, 'D')).fillna(0)
df['_m_grp'] = df.groupby('ID')['_m_grp'].cumsum() // 30
df['Mean'] = df.groupby(by=['ID', '_m_grp'])['Delay'].transform(np.mean)

结果看起来像(删除了一些列):

      ID Invoice_Date  Delay  Mean
0  27459   2020-06-26      2   0.6
1  27459   2020-06-29     -2   0.6
2  27459   2020-06-30      0   0.6
3  27459   2020-07-14      1   0.6
4  27459   2020-07-25      2   0.6
5  27459   2020-07-30      9  10.5
6  27459   2020-08-02     12  10.5
7  48002   2020-05-13     29  29.0
8  48002   2020-06-20      0   0.5
9  48002   2020-06-28      1   0.5

如果您真的想用这种方法进行选择性填充,请尝试:

df['_m_grp'] = (df.groupby('ID')['Invoice_Date'].diff()
                / np.timedelta64(1, 'D')).fillna(0)
df['_m_grp'] = df.groupby('ID')['_m_grp'].cumsum() // 30
df['_m_fill'] = df.groupby('ID')['_m_grp'].diff(-1).fillna(-1.)
df['Mean'] = (df.groupby(by=['ID', '_m_grp'])['Delay']
              .transform(np.mean)[df['_m_fill'] == -1.])

结果:

      ID Invoice_Date  Delay  Mean
0  27459   2020-06-26      2   NaN
1  27459   2020-06-29     -2   NaN
2  27459   2020-06-30      0   NaN
3  27459   2020-07-14      1   NaN
4  27459   2020-07-25      2   0.6
5  27459   2020-07-30      9   NaN
6  27459   2020-08-02     12  10.5
7  48002   2020-05-13     29  29.0
8  48002   2020-06-20      0   NaN
9  48002   2020-06-28      1   0.5