我需要为银行客户处理每日数据。两个客户端的示例数据将加载到df dataframe中。
我需要单独处理每个客户端,并且需要日复一日地进行迭代,以便滚动余额和保留数据,因此我在account_id + bus_dt上创建索引并按此索引对df进行排序。 我需要计算行之间的日期和平衡差异,所以我需要以前的行值。
对于第一行的每个帐户,我需要将所有值重置为特定值,因此我使用cumcount()
函数在每个组中创建序列号。
我能够更新df中sequence = 0的所有行,但是我无法更新序列> 0和其他条件所需的行: 我可以使用where条件的第一部分访问所选行:
df.loc[df['seq'] > 0 , 'balance']
我可以使用where条件的第二部分访问所选行:
df.loc[df['bus_dt'] - df['prev_bus_dt'] <= pd.Timedelta('2 days') , 'balance']
但我无法同时使用两个标准访问所需的行:
df.loc[(df['seq'] > 0 and df['bus_dt'] - df['prev_bus_dt'] <= pd.Timedelta('2 days')) , 'balance']
有错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Ludwik\Python\python-3.5.4rc1-embed-amd64\lib\site-packages\pandas\core\generic.py", line 955, in __nonzero__
.format(self.__class__.__name__))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
我如何使用A&gt; 0 和 B =无论什么?
以下是准备数据的完整代码:
import pandas as pd
import numpy as np
# np.array([3] * 4 + [4] * 5 ,dtype='int32')
# dates = np.array(pd.date_range('20170101', periods=6) + pd.date_range('20170101', periods=5) )
df1=pd.DataFrame({
'account_id': np.array([101] * 7,dtype='int32'),
'bus_dt': pd.date_range('20170101', periods=7),
'balance': abs(np.random.randn(7)*100)
})
df2=pd.DataFrame({
'account_id': np.array([102] * 10,dtype='int32'),
'bus_dt': pd.date_range('20170104', periods=10),
'balance': abs(np.random.randn(10)*100)
})
df1=df1.loc[df1['bus_dt'] != '20170103']
df1=df1.loc[df1['bus_dt'] != '20170104']
df2=df2.loc[df2['bus_dt'] != '20170111']
df2=df2.loc[df2['bus_dt'] != '20170112']
df=df1.append(df2)
df.head()
# i need to process each account separately and need to iterate day after day, for rolling and retaining data,
# so i create index on account_id and bus_dt and sort df by this index
df.set_index(['account_id','bus_dt'], inplace=True, drop=False)
df.sort_index(ascending=[True,True], inplace=True)
# i need to calculate date differences between rows, so i need prev row values
df['prev_bus_dt']=df.groupby(level=0)['bus_dt'].shift(1)
df['prev_balance']=df.groupby(level=0)['balance'].shift(1)
#i need to zero first row in each group, so i create sequence in each group to access 0 indexed row in each group
df['seq']=df.groupby(level=0).cumcount()
# so I update
df.loc[df['seq'] == 0, 'prev_bus_dt'] = df['bus_dt']
df.loc[df['seq'] == 0, 'prev_balance'] = df['balance']
以下是我努力的工作。如何在符合beow标准的所有数据帧行上执行更新?
# but when I need to update selected column based on complex where criteria, here starts the problem:
# all of the below methods do not work
# option 1
df.loc[df['seq'] > 0 and df['bus_dt'] - df['prev_bus_dt'] <= pd.Timedelta('2 days'), 'balance']=max(df['prev_balance'] - df['balance'],0)
# option 2
df['balance']=np.where(df['seq'] > 0 and df['bus_dt'] - df['prev_bus_dt'] <= pd.Timedelta('2 days'), max(df['prev_balance'] - df['balance'],0), 0)
# option 3
df.loc[df['seq'] > 0 and df['bus_dt'] - df['prev_bus_dt'] <= pd.Timedelta('2 days'),'balance'].all()=max(df['prev_balance'] - df['balance'],0)
我是Python的新手,并试图复制SAS实现的逻辑,我希望,这与我在这里写的相同。上面的所有操作都可以在整个数据帧上“一次”执行,并且是准备数据以启用逐行迭代,因此我愿意接受任何无效或未正确实现的建议。
答案 0 :(得分:1)
您需要使用&
运算符代替and
:
df.loc[(df['seq'] > 0) & (df['bus_dt'] - df['prev_bus_dt'] <= pd.Timedelta('2 days')) , 'balance']
答案 1 :(得分:1)
&
用于将AND
一起用于DataFrame选择中的谓词:
df.loc[((df['seq'] > 0) & (df['bus_dt'] - df['prev_bus_dt'] <= pd.Timedelta('2 days'))) , 'balance']
另请注意,您必须在每个谓词周围添加()
。