我正在使用pandas(0.25.3)和Python(3.7.4)。我正在使用类似于下面的 df1 的DataFrame。我需要根据同一DataFrame中“支付代码”字段的值,有条件地将“小时”和“工资”字段转换为“总工时”,“总工资”,“常规工资”字段。我还需要按“检查日期”分组。
df1 = pd.DataFrame( {
"Pay Code" : ["1","4","OCH","3","3"],
"Check Date" : ["2019-01-04","2019-01-04","2019-01-04","2019-01-04","2019-01-18"],
"Pay Start Date" : ["2018-12-15","2018-12-15","2018-12-15","2018-12-15","2018-12-29"],
"Pay End Date" : ["2018-12-28","2018-12-28","2018-12-28","2018-12-28","2019-01-11"],
"Pay Code Description" : ["REGULAR PAY","HOLIDAY PAY","ON CALL HOURLY","VACATION PAY","VACATION PAY"],
"Hours" : [46.0,16.0,152.0,18.0,19.5],
"Wages" : [1226.58,426.64,63.33,479.98,530.38],
"Gross Hours" : ["NaN","NaN","NaN","NaN","NaN"],
"Regular Wages" : ["NaN","NaN","NaN","NaN","NaN"],
"Overtime Wages" : ["NaN","NaN","NaN","NaN","NaN"]
} )
让我们说我有静态列表作为参考来确定将值转换为哪一列。
GrossHours = ['1','2','3']
RegularWages = ['1','3','4']
OvertimeWages = ['2','OCH']
所需的结果将是此DataFrame
df_result = pd.DataFrame( {
"Check Date" : ["2019-01-04","2019-01-18"],
"Pay Start Date" : ["2018-12-15","2018-12-29"],
"Pay End Date" : ["2018-12-28","2019-01-11"],
"Hours" : [232,19.5],
"Wages" : [2196.53,530.38],
"Gross Hours" : [64.0,19.5],
"Regular Wages" : [2133.2,530.38],
"Overtime Wages" : [63.33,"NaN"]
} )
我在想什么? 我曾尝试对 df1 应用大量的lambda函数,以根据需要提供结果,但是我不确定如何将这些结果对象干净地返回到原始DataFrame df1。制作一堆中间数据框,然后再联接或合并到原始文件,然后再次进行分组依据修改的唯一选择吗?
g1 = df1.groupby(["Check Date"])
g1.apply(lambda x: x[x['Pay Code'].isin(GrossHours)]['Hours'].astype(float).sum())
Check Date
2019-01-04 64.0
2019-01-18 19.5
dtype: float64
答案 0 :(得分:0)
首先,我建立了一个元组列表以进行迭代。
transformations = [('Gross_Hours', ['1','2','3']), ('Regular_Wages', ['1','3','4']), ('Overtime_Wages', ['2','OCH'])]
我还定义了我期望的输出数据帧的结构。
result_dataframe_fields = ['Check Date', 'Pay Start Date','Pay End Date','Gross Hours', 'Regular Wages', 'Overtime Wages']
通过将@Datanovice的建议应用于与我已经走过的路类似的路径,我最终得到了这个建议,它尽可能清晰易读。
# Instatiate result dataframe
df_result = df1.groupby(result_dataframe_fields).sum().reset_index()
for t_ix, t_list in transformations:
# Create aggregated set to populate result dataframe
if t_ix == 'Gross_Hours':
g1 = df1.loc[df1['Pay Code'].isin(t_list)].groupby('Check Date')['Hours'].agg(temp_col_name='sum')
g2 = g1.reset_index()
g2.columns = ['Check Date', t_ix]
else:
g1 = df1.loc[df1['Pay Code'].isin(t_list)].groupby('Check Date')['Wages'].agg(temp_col_name='sum')
g2 = g1.reset_index()
g2.columns = ['Check Date', t_ix]
#Handle the .agg() column naming limitation (no spaces on list agg)
colsg2 = g2.columns
colsg2 = colsg2.map(lambda x: x.replace('_', ' ') if isinstance(x, (str)) else x)
g2.columns = colsg2
# Dataframe copy that will update result dataframe
update_df = g2.copy()
df_result.update(update_df)
我仍然希望这不是最好的答案,因为我的实际应用程序要比这个大得多,并且看起来相当可怕,超出了我的“实际代码”规模。