我在熊猫中有这个数据框:
day customer amount
0 1 cust1 500
1 2 cust2 100
2 1 cust1 50
3 2 cust1 100
4 2 cust2 250
5 6 cust1 20
为方便起见:
df = pd.DataFrame({'day': [1, 2, 1, 2, 2, 6],
'customer': ['cust1', 'cust2', 'cust1', 'cust1', 'cust2', 'cust1'],
'amount': [500, 100, 50, 100, 250, 20]})
我想创建一个新的列“ amount2days”,以便汇总最近两天每个客户的金额,以获得以下数据框:
day customer amount amount2days ----------------------------
0 1 cust1 500 500 (no past transactions)
1 2 cust2 100 100 (no past transactions)
2 1 cust1 50 550 (500 + 50 = rows 0,2
3 2 cust1 100 650 (500 + 50 + 100, rows 0,2,3)
4 2 cust2 250 350 (100 + 250, rows 1,4)
5 6 cust1 20 20 (notice day is 6, and no day=5 for cust1)
即我想执行以下(伪)代码:
df['amount2days'] = df_of_past_2_days['amount'].sum()
每行。这样做最方便的方法是什么?
我希望进行的求和是一天中的总和,但是不一定要在每个新行中都增加天,如示例所示。我还是想对过去两天的金额进行汇总。
答案 0 :(得分:2)
将groupby
与Series.rolling
和sum
一起使用
通知:
为了避免数据错误对齐,这里有必要添加DataFrame.reset_index
:
df['amount2days'] = (df.groupby('customer')['amount']
.rolling(2, min_periods=0)
.sum()
.reset_index(level=0, drop=True))
print (df)
day customer amount amount2days
1 1 cust1 500 500.0
2 2 cust1 100 600.0
3 3 cust1 250 350.0
为什么在这里不使用.to_numpy
?因为如果不是默认索引,那么输出应该被错误分配-请检查以下示例:
df = pd.DataFrame({'day': {0: 1, 2: 2, 5: 3, 1: 1, 6: 2, 4: 3}, 'customer': {0: 'cust2', 2: 'cust2', 5: 'cust2', 1: 'cust1', 6: 'cust1', 4: 'cust1'}, 'amount': {0: 5000, 2: 1000, 5: 2500, 1: 500, 6: 100, 4: 250}})
print (df)
day customer amount
0 1 cust2 5000
2 2 cust2 1000
5 3 cust2 2500
1 1 cust1 500
6 2 cust1 100
4 3 cust1 250
df['amount2days'] = (df.groupby('customer', sort=False).amount
.rolling(2, min_periods=0)
.sum()
.to_numpy())
df['amount2days1'] = (df.groupby('customer')['amount']
.rolling(2, min_periods=0)
.sum()
.reset_index(level=0, drop=True))
print (df)
day customer amount amount2days amount2days1
0 1 cust2 5000 500.0 5000.0
2 2 cust2 1000 600.0 6000.0
5 3 cust2 2500 350.0 3500.0
1 1 cust1 500 5000.0 500.0
6 2 cust1 100 6000.0 600.0
4 3 cust1 250 3500.0 350.0
编辑:常规解决方案:
def f(x):
N = 1
for i in pd.unique(x['day']):
y = x[x['day'].between(i - N, i)]
x.loc[y.index[-1], 'amountNdays'] = y['amount'].sum()
return x
df = df.groupby('customer').apply(f)
df['amountNdays'] = df['amountNdays'].fillna(df['amount'])
print (df)
day customer amount amountNdays
0 1 cust1 500 500.0
1 2 cust2 100 100.0
2 1 cust1 50 550.0
3 2 cust1 100 650.0
4 2 cust2 250 350.0
5 6 cust1 20 20.0
答案 1 :(得分:1)
您可以使用熊猫的rolling
来移动窗口操作(取决于熊猫的版本,reset_index
就像jezrael的回答会更安全):
df['amount2days'] = (df.groupby('customer', sort=False).amount
.rolling(2, min_periods=0)
.sum()
.to_numpy())
print(df)
day customer amount amount2days
1 1 cust1 500 500.0
2 2 cust1 100 600.0
3 3 cust1 250 350.0