Question

我有一个数据库，其中包含具有产品类型和产品线（一组产品类型）的产品。我必须计算每种产品类型的平均销售额，到这里为止很容易：

df.groupby('Type')['Sales'].avg()

问题是某些类型的统计信息较低，例如用于新产品。因此，在这种情况下，企业希望使用产品线平均值而不是单个产品类型平均值。

因此，从本质上讲，我必须构建一个自定义的聚合函数，该函数将根据组数来更改行为，并且顺便说一句，当统计信息较少时，它将需要访问整个数据库上的信息。

解决此问题的最佳方法是什么？

我已经尝试过对其进行分组和循环。它可以工作，但是然后我必须将值重新填入表中，但我不知道如何做。另一种方法是创建一个自定义的聚合函数，然后通过.agg传递它，但我不知道该怎么做。

group = df.groupby('Type')['Sales'].avg()
for name, group in tab_sales_per_machines:
    nmachines = group['Machine'].nunique()
    if nmachines < 5 :
        ... do stuff using df...
    else :
        group['Sales'].avg()

Answer 1

您可以尝试使用apply（在参数上比agg更具灵活性）：

def your_func(group):
    nmachines = group.Machine.nunique()
    if nmachines < 5 :
        ... do stuff using df...
        return stuff
    # default is to return Sales avg
    return group.Sales.avg()

df.groupby('Type').apply(your_func)

Answer 2

我设法通过在组上循环来解决它。我在这里发布我的解决方案。它可以工作，但似乎并不是最优雅的方法。如果有人有更好的主意，我很乐意听到。 N.B .：功能比这要复杂一些：我试图将其简化为需要理解的基本要素。

def getSalesPerMachine(df) :

    groups  = df[['Type','Sales','Product Line','Machine']].groupby('Type', as_index=False)

    # Build the output table
    tab = groups.agg({'Machine':'nunique', 'Sales':'sum', 'Product Line' : 'max'})
    tab['Annual sales'] = np.nan  ## <-- Create the column where I'll put the result.

    for name, group in groups:

        ## If stats is low use the full product line (rescaled)
        nmachines = group.Machine.nunique()

        if nmachines < 5 :

            # Retrieve the product line
            pl = group['Product Line'].max()

            ## Get all machines of that product line
            mypl = df.loc[df['Product Line'] == pl]

            ## Assign to sales the total of the PL rescales for how many machines of the specific type
            sales = mypl.Sales.sum() * nmachines /  mypl.Machine.nunique()

        else :
            # If high stats just return the sum plain and simple
            sales = group.Sales.sum() 

        # Save result (this was where I was stuck before)
        tab['Annual sales'] = \
            np.where(tab['Type']==name, annualSales, tab['Annual sales'])

    return tab

根据熊猫中的列数进行自定义分组

2 个答案: