Question

我正在使用以下代码行来计算条件概率

    variable = 'variable_name'
    probs = df.groupby(variable).size().div(len(df))
    cond_probs = df.groupby([variable, 'has_income']).size().div(len(df)).div(probs, axis=0, level=variable)

这些将导致以下输出：

    varibale_name         has_income
    (0.999, 2.0]          False          0.756323
                          True           0.243677
    (2.0, 3.0]            False          0.798372
                          True           0.201628
    (3.0, 16.0]           False          0.809635
                          True           0.190365

我想在输出中添加额外的列作为每个组的样本大小，但是我无法在lambda函数内重写公式，因为组对象与以下方法不具有相同的方法 df.groupby（）返回的对象。示例：

    cond_probs =df.groupby([variable, 'has_income']).apply(lambda x: 
    pd.Series({
        'probs': x.size().div(len(df)).div(probs, axis=0, level=variable),
        'size': x.size()
    }))

错误：TypeError：“ numpy.int32”对象不可调用

是否有其他选择可以以理想的方式获得这些结果，而无需计算两个groupby并在最后加入数据帧？

Answer 1

将apply与groupby一起使用时，不会得到组对象，而是获得了与相关组相对应的数据框的一部分。因此，在您的情况下，x是一个DataFrame，而不是GroupBy对象-请像对待df一样对待它。

cond_probs = df.groupby([variable, 'has_income']).apply(lambda x: 
  pd.Series({
    'probs': (len(x) / len(df)) / probs[x.iloc[0][variable]],
    'size': len(x)
  })
)

NB （如果在数据帧上使用.size，它将返回单元格总数-因此与GroupBy.size（docs）不同

无法通过Lambda函数在熊猫groupby中使用.size（）.div（）方法

1 个答案: