在Pandas数据框中对数据进行分组和有条件转换的最干净方法是什么?

时间:2019-11-19 18:37:38

标签: python pandas dataframe

我正在使用pandas(0.25.3)和Python(3.7.4)。我正在使用类似于下面的 df1 的DataFrame。我需要根据同一DataFrame中“支付代码”字段的值,有条件地将“小时”和“工资”字段转换为“总工时”,“总工资”,“常规工资”字段。我还需要按“检查日期”分组。

df1 = pd.DataFrame( {
                        "Pay Code" : ["1","4","OCH","3","3"],
                        "Check Date" : ["2019-01-04","2019-01-04","2019-01-04","2019-01-04","2019-01-18"],
                        "Pay Start Date" : ["2018-12-15","2018-12-15","2018-12-15","2018-12-15","2018-12-29"],
                        "Pay End Date" : ["2018-12-28","2018-12-28","2018-12-28","2018-12-28","2019-01-11"],
                        "Pay Code Description" : ["REGULAR PAY","HOLIDAY PAY","ON CALL HOURLY","VACATION PAY","VACATION PAY"],
                        "Hours" : [46.0,16.0,152.0,18.0,19.5],
                        "Wages" : [1226.58,426.64,63.33,479.98,530.38],
                        "Gross Hours" : ["NaN","NaN","NaN","NaN","NaN"],
                        "Regular Wages" : ["NaN","NaN","NaN","NaN","NaN"],
                        "Overtime Wages" : ["NaN","NaN","NaN","NaN","NaN"]
                  } )

让我们说我有静态列表作为参考来确定将值转换为哪一列。

GrossHours = ['1','2','3']

RegularWages = ['1','3','4']

OvertimeWages = ['2','OCH']

所需的结果将是此DataFrame

df_result = pd.DataFrame( {
                        "Check Date" : ["2019-01-04","2019-01-18"],
                        "Pay Start Date" : ["2018-12-15","2018-12-29"],
                        "Pay End Date" : ["2018-12-28","2019-01-11"],
                        "Hours" : [232,19.5],
                        "Wages" : [2196.53,530.38],
                        "Gross Hours" : [64.0,19.5],
                        "Regular Wages" : [2133.2,530.38],
                        "Overtime Wages" : [63.33,"NaN"]
                  } )

我在想什么? 我曾尝试对 df1 应用大量的lambda函数,以根据需要提供结果,但是我不确定如何将这些结果对象干净地返回到原始DataFrame df1。制作一堆中间数据框,然后再联接或合并到原始文件,然后再次进行分组依据修改的唯一选择吗?

g1 = df1.groupby(["Check Date"])

g1.apply(lambda x: x[x['Pay Code'].isin(GrossHours)]['Hours'].astype(float).sum())

Check Date
2019-01-04    64.0
2019-01-18    19.5
dtype: float64

1 个答案:

答案 0 :(得分:0)

首先,我建立了一个元组列表以进行迭代。

transformations = [('Gross_Hours', ['1','2','3']), ('Regular_Wages', ['1','3','4']), ('Overtime_Wages', ['2','OCH'])]

我还定义了我期望的输出数据帧的结构。

result_dataframe_fields = ['Check Date', 'Pay Start Date','Pay End Date','Gross Hours', 'Regular Wages', 'Overtime Wages']

通过将@Datanovice的建议应用于与我已经走过的路类似的路径,我最终得到了这个建议,它尽可能清晰易读。

# Instatiate result dataframe
df_result = df1.groupby(result_dataframe_fields).sum().reset_index()

for t_ix, t_list in transformations:
    # Create aggregated set to populate result dataframe
    if t_ix == 'Gross_Hours':
        g1 = df1.loc[df1['Pay Code'].isin(t_list)].groupby('Check Date')['Hours'].agg(temp_col_name='sum')
        g2 = g1.reset_index()
        g2.columns = ['Check Date', t_ix]
    else:
        g1 = df1.loc[df1['Pay Code'].isin(t_list)].groupby('Check Date')['Wages'].agg(temp_col_name='sum')
        g2 = g1.reset_index()
        g2.columns = ['Check Date', t_ix]

    #Handle the .agg() column naming limitation (no spaces on list agg)
    colsg2 = g2.columns
    colsg2 = colsg2.map(lambda x: x.replace('_', ' ') if isinstance(x, (str)) else x)
    g2.columns = colsg2

    # Dataframe copy that will update result dataframe
    update_df = g2.copy()

    df_result.update(update_df)

Result Image From Jupyter Lab

我仍然希望这不是最好的答案,因为我的实际应用程序要比这个大得多,并且看起来相当可怕,超出了我的“实际代码”规模。