Question

让我们假设这些数据：

df = pd.DataFrame(
    {'user_id'         : [1,  1,    2,  2, 1, 3,   1 ],
     'purchase_id'     : [3,  2,    3,  1, 1, 2,   3 ],
     'purchase_amount' : [10, 0.50, 10, 1, 1, 0.50,10]}
)

我有一个要应用的自定义函数，并且可以运行：

def m(x):

    len(x)
    x = np.mean(x ** 2)
    return(x)

print(df['purchase_amount'].aggregate(m))
#> 43.214285714285715

但是，当我删除（看似无关）len()语句时，代码将失败：

def m(x):

    # len(x) 
    x = np.mean(x ** 2)
    return(x)

print(df['purchase_amount'].aggregate(m))
#> 0    10.0
#> 1     0.5
#> 2    10.0
#> 3     1.0
#> 4     1.0
#> 5     0.5
#> 6    10.0
#> Name: purchase_amount, dtype: float64

如果我将# len(x)替换为一些非注释（例如1），也会产生相同的意外结果。

对我来说，这真的是出乎意料的。我想念什么？我在Windows上运行pandas 0.24.1。

Answer 1

TL; DR ：要获得所需的输出，您只需执行print(np.mean(df['purchase_amount'] ** 2))

Series.aggregate文档说：

func：函数，str，列表或dict用于聚合数据。如果是函数，则必须在通过系列时或传递给Series.apply 。

当您拥有len(x)时，它会在第一次调用m时引发异常（因为x是一个float且float对象没有len）。此异常导致熊猫回退并再次调用m，这一次将其传递给apply（已记录）。

如果我们调查Series.aggregate来源，我们会看到以下行为：

...
result = None
if axis == 0:
    try:
        result, how = self._aggregate(func, axis=0, *args, **kwargs)
    except TypeError:
        pass
if result is None:
    return self.apply(func, axis=axis, args=args, **kwargs)
return result

自定义函数与熊猫聚合的怪异行为

1 个答案: