数据框子集上的“迭代”窗口函数

时间:2018-10-24 16:01:40

标签: python pandas function window subset

我正在寻找一种从下面的数据框'min_value'创建列df的方法。对于每行i,我们从整个数据帧中子集所有与行i的分组['Date_A', 'Date_B']对应且条件'Advance'小于行i的'Advance'的所有记录,最后我们从该子集中选择'Amount'列中的最小值,为第i行设置'min_value'

初始数据框:

dates_A = ['2017-12-25','2017-12-25','2017-12-25','2018-1-25','2018-1-25','2018-1-25']
Date_A = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_A]
dates_B = ['2018-1-1','2018-1-1','2018-1-1','2018-2-1','2018-2-1','2018-2-1']
Date_B = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_B]
df = pd.DataFrame({'Date_A':Date_A,
       'Date_B':Date_B,        
       'Advance' : [10,103,200,5,8,150],
       'Amount' : [180,220,200,230,220,240]})

df  = df [['Date_A', 'Date_B', 'Advance', 'Amount']]
df 

所需的输出:

dates_A = ['2017-12-25','2017-12-25','2017-12-25','2018-1-25','2018-1-25','2018-1-25']
Date_A = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_A]
dates_B = ['2018-1-1','2018-1-1','2018-1-1','2018-2-1','2018-2-1','2018-2-1']
Date_B = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_B]
df_out = pd.DataFrame({'Date_A':Date_A,
       'Date_B':Date_B,        
       'Advance' : [10,103,200,5,8,150],
       'Amount' : [180,220,200,230,220,240],
       'min_value': [180,180,180,230,230,220] })

df_out  = df_out [['Date_A', 'Date_B', 'Advance', 'Amount','min_value']]
df_out 

我写了下面的循环,我认为它可以完成工作,但是运行起来太长了,我想必须有更有效的方法来完成此工作。

for i in range(len(df)):
    date1=df['Date_A'][i] #select the date A of the row i 
    date2=df['Date_B'][i] #select the date B of the row i 
    advance= df['Advance'][i] #select the advance of the row i 
    df.loc[i,'min_value'] = df[df['Date_A']==date1][df['Date_B']==date2][df['Advance']<advance]['Amount'].min()  # subset the entire dataframe to meet dates and advance conditions
    df.loc[df['min_value'].isnull(),'min_value']=df['Amount'] # for the smallest advance value, ste min=to its own amount
df

我希望已经足够清楚了,谢谢您的帮助。

改善问题 非常感谢您的回答。对于最后一部分,NA行,我想用Date_A,Date_B,advanance分组的总金额替换行的金额,以便获得date_A之前最后一天的总最低金额

提高期望的输出(两个累进器以获得最小的提前值)

dates_A = ['2017-12-25','2017-12-25','2017-12-25','2017-12-25']
Date_A = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_A]
dates_B = ['2018-1-1','2018-1-1','2018-1-1','2018-1-1']
Date_B = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_B]

df_out = pd.DataFrame({'Date_A':Date_A,
       'Date_B':Date_B,        
       'Advance' : [5,8,150,5],
       'Amount' : [230,220,240,225],
       'min_value': [225,230,220,225] })

df_out  = df_out [['Date_A', 'Date_B', 'Advance', 'Amount','min_value']]
df_out 

谢谢

1 个答案:

答案 0 :(得分:1)

groupby'Date_A'函数cummin和{{1}对值进行排序后,可以在'Date_B''Advance'上使用apply }}到shift列中。然后将'Amount'与列fillna中的值一起使用,例如:

'Amount'

您会得到:

df['min_value'] = (df.sort_values('Advance').groupby(['Date_A','Date_B'])['Amount']
                      .apply(lambda ser_g: ser_g.cummin().shift()).fillna(df['Amount']))