Question

我有一个数据框，它是groupby的输出，使用pd.cut创建的分类变量。

import pandas as pd
import numpy as np

di = pd.DataFrame({'earnings':pd.np.random.choice(10000, 10000), 'counts':[1] * 10000})
brackets=append(np.arange(0,5001,500),100000000)
di['earncat']=pd.cut(di['earnings'], brackets,right=False,retbins=True)[0]

di_everyone=di.groupby('earncat').sum()[['counts']]
di_everyone.sort_index(inplace=True)
di_everyone.to_string

这是输出，

[0, 500)          83,005,823
[1000, 1500)      11,995,255
[1500, 2000)      13,943,052
[2000, 2500)      11,967,696
[2500, 3000)      10,741,178
[3000, 3500)       9,749,914
[3500, 4000)       6,833,928
[4000, 4500)       7,150,125
[4500, 5000)       4,655,773
[500, 1000)        9,718,753
[5000, 100000000) 26,588,622

我不确定为什么[500,1000]出现在倒数第二行。我决定不给winscat贴上标签，因为我想看看故障情况。我怎样才能对winscat进行排序？

提前致谢

Answer 1

您可能正在使用pandas 0.15.x，它不支持使用分类dtypes（pd.cut函数生成）的此类操作

与此同时，你可以解决这个问题：

di['earnlower'] = di['earncat'].apply(lambda x: int(x[1:].split(',')[0]))
di['earnhigher'] = di['earncat'].apply(lambda x: int(x[:-2].split(',')[1]))

di_everyone=di.groupby(['earnlower', 'earnhigher']).sum()[['counts']]

pandas：如何使用pd.cut分类变量对groupby的结果进行排序

1 个答案: