Question

拜托，我无法理解这个功能的作用。这是代码上下文：

    #group outcomes into bins of similar probability
    bins = np.linspace(0, 1, 20)
    cuts = pd.cut(prob, bins)
    print(cuts)
    binwidth = bins[1] - bins[0]

    #freshness ratio and number of examples in each bin
    cal = data.groupby(cuts).outcome.agg(['mean', 'count'])
    print(cal['count'])
    print(cal['mean'])
    cal['pmid'] = (bins[:-1] + bins[1:]) / 2
    cal['sig'] = np.sqrt(cal.pmid * (1 - cal.pmid) / cal['count'])

    #the calibration plot
    ax = plt.subplot2grid((3, 1), (0, 0), rowspan=2)
    p = plt.errorbar(cal.pmid, cal['mean'], cal['sig'])
    plt.plot(cal.pmid, cal.pmid, linestyle='--', lw=1, color='k')
    plt.ylabel("Empirical Fraction")

Answer 1

data是DataFrame，其中包含名为outcome的列。代码的显着部分是：

cal = data.groupby(cuts).outcome.agg(['mean', 'count'])

这样做的顺序是：

根据“剪切”列（further reference）中的条目对数据进行分组。
获取与“结果”列对应的SeriesGroupBy。
创建一个DataFrame，其中包含两列“均值”和“计数”，适用于SeriesGroupBy中的每个组（请参阅例如here）。
将其分配给cal变量。

函数data.groupby（cut）.outcome.agg在pandas库中做了什么？

1 个答案: