Question

我想最好使用python将列中的现有值替换为同一列的平均值。我想将付款平均分配给从付款的第一个月到最后一个月的所有月份。平均每月付款应按 cust_id 和 sub_id 分配。

付款可能会跳过几个月并且不一样。

希望你能帮助我，因为我才刚刚开始学习 Python。

数据如下：

<头>

cust_id	sub_id	日期	付款
1	A	12/1/20	200
1	A	2/2/21	200
1	A	2/3/21	100
1	A	5/1/21	200
1	B	1/2/21	50
1	B	1/9/21	20
1	B	3/1/21	80
1	B	4/23/21	90
2	C	1/4/21	200
2	C	1/9/21	300

我想要的结果是：

<头>

cust_id	sub_id	日期	付款
1	A	12/1/20	116.67
1	A	1/1/21	116.67
1	A	2/1/21	116.67
1	A	3/1/21	116.67
1	A	4/1/21	116.67
1	A	5/1/21	116.67
1	B	1/1/21	60
1	B	2/1/21	60
1	B	3/1/21	60
1	B	4/1/21	60
2	C	1/1/21	500

非常感谢！

Answer 1

如评论中所述，您对 cust_id=2 和 sub_id='C' 的回答似乎与您的要求不一致，所以我选择后者。

首先，我们将日期汇总为最小值、最大值并将付款汇总为总和：

df2 = df.groupby(['cust_id','sub_id']).agg({'date':[min,max], 'payment':sum})
df2.columns = df2.columns.get_level_values(1)
df2

我们得到

        min         max         sum
cust_id sub_id          
1   A   2020-12-01  2021-05-01  700
    B   2021-01-02  2021-04-23  240
2   C   2021-01-04  2021-01-09  500

然后我们为从最小值到最大值的每一行创建一个月度计划。在这里，您可能需要稍微摆弄日期才能将它们排列整齐，我只是做了一些基础知识来展示这个想法：

from datetime import timedelta
df2['schedule'] = df2.apply(lambda row: pd.date_range(row['min'],row['max'] + timedelta(days = 31), freq = '1M'),axis=1)

现在 df2 看起来像这样：


          min                  max                    sum  schedule
--------  -------------------  -------------------  -----  ---------------------------------------------------------------------------------------------------------
(1, 'A')  2020-12-01 00:00:00  2021-05-01 00:00:00    700  DatetimeIndex(['2020-12-31', '2021-01-31', '2021-02-28', '2021-03-31',
                                                                          '2021-04-30', '2021-05-31'],
                                                                         dtype='datetime64[ns]', freq='M')
(1, 'B')  2021-01-02 00:00:00  2021-04-23 00:00:00    240  DatetimeIndex(['2021-01-31', '2021-02-28', '2021-03-31', '2021-04-30'], dtype='datetime64[ns]', freq='M')
(2, 'C')  2021-01-04 00:00:00  2021-01-09 00:00:00    500  DatetimeIndex(['2021-01-31'], dtype='datetime64[ns]', freq='M')

现在我们explode我们的“安排”并平均分配付款，并对列名等进行一些清理：

df3 = df2.groupby(['cust_id','sub_id'], as_index = False).apply(lambda g: g.explode('schedule'))
(df3.groupby(['cust_id','sub_id'], as_index = False)
    .apply(lambda g: g.assign(sum = g['sum']/len(g)))
    .reset_index(drop = False)
    .drop(columns = ['min','max','level_0'])
    .rename(columns = {'sum':'payment'})
)

得到

      cust_id  sub_id      payment  schedule
--  ---------  --------  ---------  -------------------
 0          1  A           116.667  2020-12-31 00:00:00
 1          1  A           116.667  2021-01-31 00:00:00
 2          1  A           116.667  2021-02-28 00:00:00
 3          1  A           116.667  2021-03-31 00:00:00
 4          1  A           116.667  2021-04-30 00:00:00
 5          1  A           116.667  2021-05-31 00:00:00
 6          1  B            60      2021-01-31 00:00:00
 7          1  B            60      2021-02-28 00:00:00
 8          1  B            60      2021-03-31 00:00:00
 9          1  B            60      2021-04-30 00:00:00
10          2  C           500      2021-01-31 00:00:00

Answer 2

只需几个步骤即可使用 resample() 和 transform() 函数完成此操作：

首先，我们将缺失的月份添加到原始表中，将所有日期值更改为该月的第一天，将同一月份的行与添加的原始付款值合并，并将 0 放在新行的付款列中:

resampled_df = (df
   .set_index('date')
   .groupby(['cust_id', 'sub_id'])
   .resample('MS')
   .agg({'payment': sum})
   .reset_index()
)

然后，我们计算每个组所有月份的平均值，并将该平均值分配给该组中的每一行，将结果分配给一个新列：

resampled_df['avg_monthly_payment'] = (resampled_df
   .groupby(['cust_id', 'sub_id'])['payment']
   .transform('mean')
)

查找列中值的平均值并创建一个分布平均值的新数据框

2 个答案: