Question

考虑以下简单示例。我有兴趣获得一个包含与分位数对应的类别的分类变量。

  df = pd.DataFrame({'A':'foo foo foo bar bar bar'.split(),
                       'B':[0, 0, 1]*2})

df
Out[67]: 
     A  B
0  foo  0
1  foo  0
2  foo  1
3  bar  0
4  bar  0
5  bar  1

在Pandas中，qtile完成了这项工作。不幸的是，qtile因为数据中的联系而在这里失败。

df['C'] = df.groupby(['A'])['B'].transform(
                     lambda x: pd.qcut(x, 3, labels=range(1,4)))

给出了经典的ValueError: Bin edges must be unique: array([ 0. , 0. , 0.33333333, 1. ])

是否有另一个强大的解决方案（来自任何其他python包）不需要重新发明轮子？

必须如此。我不想自己编码我自己的分位数bin函数。任何体面的统计数据包都可以在创建分位数分箱（SAS，Stata等）时处理关系。

我希望有一些基于合理的方法选择和强大的东西。

例如，请在此处查看SAS https://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a000146840.htm中的解决方案。

或者这里是斯塔塔着名的xtile（http://www.stata.com/manuals13/dpctile.pdf）。请注意这个帖子Definitive way to match Stata weighted xtile command using Python?

我错过了什么？也许使用Scipy？

非常感谢！

Answer 1

IIUC，您可以使用numpy.digitize

df['C'] = df.groupby(['A'])['B'].transform(lambda x: np.digitize(x,bins=np.array([0,1,2])))

     A  B  C
0  foo  0  1
1  foo  0  1
2  foo  1  2
3  bar  0  1
4  bar  0  1
5  bar  1  2

当数据中存在联系时，如何计算Pandas中的分位数区间？

1 个答案: