Below is my dataframe
Txn_Key Send_Agent Send_Time Pay_Time Send_Amount \
0 NaN ANO080012 2012-05-31 02:25:00 2012-05-31 21:43:00 490.00
1 NaN AUK359401 2012-05-31 11:25:00 2012-05-31 11:57:00 616.16
2 NaN ACL000105 2012-05-31 13:07:00 2012-05-31 17:36:00 193.78
3 NaN AED420319 2012-05-31 10:50:00 2012-05-31 11:34:00 999.43
4 NaN ARA030210 2012-05-30 12:14:00 2012-05-31 04:16:00 433.29
5 NaN AJ5020114 2012-05-31 02:37:00 2012-05-31 04:31:00 378.00
6 NaN A11171047 2012-05-31 09:39:00 2012-05-31 10:08:00 865.34
Pay_Amount MTCN Send_Phone Refund_Flag time_diff
0 475.68 9323625903 97549829 NaN 0 days 19:18:00
1 600.87 3545067820 440000000000 NaN 0 days 00:32:00
2 185.21 1453132764 0511 NaN 0 days 04:29:00
3 963.04 4509062067 971566016900 NaN 0 days 00:44:00
4 423.75 6898279087 144 NaN 0 days 16:02:00
5 377.99 5170985243 963954932506 NaN 0 days 01:54:00
6 833.89 5352719100 0644798854 NaN 0 days 00:29:00
因此,当下一行的Send_Amount相同时,我需要计数。使用lambda的groupby工作完全正常:
txn1 = txns.loc[:,['Send_Agent','Send_Amount']]
Send_repeat_count = txn1.groupby('Send_Agent').apply(lambda txn1 : (txn1.Send_Amount.shift() == txn1.Send_Amount).cumsum()
.... :)
但是类似的lambda函数在groupby.agg中不起作用。
grouped=txn.groupby('Send_Agent')
x=grouped.agg({'Send_Amount':'mean','Pay_Amount':'mean','time_diff':'min','MTCN':'size','Send_Phone':'nunique','Refund_Flag':'count','Send_Amount':'lambda txn1 : (txn1.Send_Amount.shift() == txn1.Send_Amount).cumsum()'})
AttributeError: 'Series' object has no attribute 'Send_Amount'
所以,我写了一个单独的函数来做同样的事情并在我的groupby.agg
中调用它 def repeat_count(x):
if x==x.shift():
....: cumsum()
x = grouped.agg({'Send_Amount':'mean','Pay_Amount':'mean','time_diff':'min','MTCN':'size','Send_Phone':'nunique','Refund_Flag':'count','Send_Amount':repeat_count(x)})
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
如果cumsum可以在group by.apply中正常工作,为什么它在函数内部不起作用。
答案 0 :(得分:1)
通常,Send_Agent
列将包含重复项(否则,按Send_Agent
进行分组将毫无意义)。此外,(x==x.shift()).cumsum()
将返回一个系列,其行数与每个Send_Agent
组中的重复项一样多。
df.groupby(...).agg(func)
要求func
返回标量(例如浮点数)。 func
不允许退回系列赛。 (相比之下,func
可以在使用Series
时返回DataFrame
甚至df.groupby(...).apply(func)
。)
如果要计算组中相邻行的数量相等,可以使用sum()
而不是cumsum()
。例如,
import numpy as np
import pandas as pd
pd.options.display.width = 1000
nan = np.nan
txn = pd.DataFrame(
{'MTCN': [0, 9323625903, 3545067820, 1453132764, 4509062067, 6898279087, 5170985243, 5352719100],
'Pay_Amount': [1, 475.68, 600.87, 185.21, 963.04, 423.75, 377.99, 833.89],
'Pay_Time': ['2012-05-31 10:08:00', '2012-05-31 21:43:00', '2012-05-31 11:57:00', '2012-05-31 17:36:00',
'2012-05-31 11:34:00', '2012-05-31 04:16:00', '2012-05-31 04:31:00',
'2012-05-31 10:08:00'],
'Refund_Flag': [nan, nan, nan, nan, nan, nan, nan, nan],
'Send_Amount': [865.34, 490.0, 616.16, 193.78, 999.43, 433.29, 378.0, 865.34],
'Send_Phone': [3, 97549829, 440000000000, 511, 971566016900, 144, 963954932506, 644798854],
'Send_Time': ['2012-05-31 09:39:00', '2012-05-31 02:25:00', '2012-05-31 11:25:00', '2012-05-31 13:07:00',
'2012-05-31 10:50:00', '2012-05-30 12:14:00', '2012-05-31 02:37:00',
'2012-05-31 09:39:00'],
'Txn_Key': [nan, nan, nan, nan, nan, nan, nan, nan],
'Send_Agent': ['A11171047', 'ANO080012', 'AUK359401', 'ACL000105', 'AED420319',
'ARA030210', 'AJ5020114', 'A11171047'],
'time_diff': ['0 days 00:29:00', '0 days 19:18:00', '0 days 00:32:00', '0 days 04:29:00',
'0 days 00:44:00', '0 days 16:02:00', '0 days 01:54:00',
'0 days 00:29:00', ]} )
txn['time_diff'] = pd.to_timedelta(txn['time_diff'])
grouped = txn.groupby('Send_Agent')
def repeat_count(s):
return (s.shift() == s).sum()
result = grouped.agg(
{'Pay_Amount':'mean',
'time_diff':'min',
'MTCN':'size',
'Send_Phone':'nunique',
'Refund_Flag':'count',
'Send_Amount': ['mean', repeat_count]})
print(result)
产量
Refund_Flag time_diff Send_Phone MTCN Send_Amount Pay_Amount
count min nunique size mean repeat_count mean
Send_Agent
A11171047 0 1740000000000 2 2 865.34 1.0 417.445
ACL000105 0 16140000000000 1 1 193.78 0.0 185.210
AED420319 0 2640000000000 1 1 999.43 0.0 963.040
AJ5020114 0 6840000000000 1 1 378.00 0.0 377.990
ANO080012 0 69480000000000 1 1 490.00 0.0 475.680
ARA030210 0 57720000000000 1 1 433.29 0.0 423.750
AUK359401 0 1920000000000 1 1 616.16 0.0 600.870
(我添加了一行,以便repeat_count
并不总是返回0。)
使用DataFrame.groupby(...).apply(func)
时,传递给func
的对象是DataFrame。因此,
txn1.groupby('Send_Agent').apply(
lambda txn1 : (txn1.Send_Amount.shift() == txn1.Send_Amount).cumsum())
有效,因为txn1
中的lambda
是一个带有Send_Amount
列的DataFrame。
相反,当您使用DataFrame.groupby(...).agg({'col': func})
时,传递给func
的对象是系列,其值来自col
指定的列。因此
x = grouped.agg({'Send_Amount':'mean','Pay_Amount':'mean','time_diff':'min','MTCN':'size','Send_Phone':'nunique','Refund_Flag':'count','Send_Amount':lambda txn1 : (txn1.Send_Amount.shift() == txn1.Send_Amount).cumsum()})
引发AttributeError: 'Series' object has no attribute 'Send_Amount'
因为系列传递给lambda
函数(并且绑定到变量txn1
)没有Send_Amount
属性。
如果您使用repeat_count
:
def repeat_count(x):
if x==x.shift():
return x.cumsum()
然后if x==x.shift()
加注
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
因为x==x.shift()
是一个系列而if expression
导致expression
在布尔上下文中进行评估。也就是说,expression.__bool__()
被调用。 __bool__
必须返回True或False或引发异常。因此,要使if x==x.shift()
有意义,(x==x.shift()).__bool__()
必须返回True或False。
Series.__bool__()
总是引发上面的ValueError
,因为当所有系列中的值为True或<时,Pandas(按设计)不会猜测它是否应该返回True em>任何的值都是True,或者当系列只是非空时等等...... ValueError
消息指向正确的方向。通常,通过调用(x==x.shift()).any()
或(x==x.shift()).all()
等来明确您想要的布尔值来解决问题。
关于效果的说明:通常情况下,使用自定义函数的groupby/agg
与使用groupby/agg
或count
等内置方法的sum
的效果不同。因此,通常需要找出一种方法(如果可能的话)来表达内置方法的计算方法。在这种情况下,您可以在整个DataFrame 上进行预备计算,然后允许您使用groupby/agg/sum
:
txn = txn.sort_values(by='Send_Agent')
txn['repeat'] = ((txn['Send_Agent'].shift() == txn['Send_Agent'])
& (txn['Send_Agent'].shift() == txn['Send_Agent']))
grouped = txn.groupby('Send_Agent')
result = grouped.agg(
{'Pay_Amount':'mean',
'time_diff':'min',
'MTCN':'size',
'Send_Phone':'nunique',
'Refund_Flag':'count',
'Send_Amount': 'mean',
'repeat':'sum'})
print(result)