Question

我的数据框包含组级别的排名：

       foo  rank
year            
2000    10   340
2000  7010   134
2000  7000   135
2000  6940    83
2000  6840    82
2000  6830    19
2000  6820    81
2000  6800   162
2000  6765   161
2000  7020   136

我写了一个函数，按某些任意n聚类排名。对于n=2，这相当于将排名较高的50％排在一起，而较低的50％：

def createRankGroups(group, n):
    maxRank = group['rank'].max()
    group['group'] = np.nan
    for i in range(1, n + 1):
        upperRankBoundary = maxRank / n * i
        idx = (group['rank'] <= upperRankBoundary) & group.group.isnull()
        group.loc[idx, 'group'] = i
    return group['group']

问题在于，当我使用apply来使用此函数时，我会得到一个不需要的索引级别，这会破坏合并。

df['group'] = df.groupby(level=0).apply(lambda x: createRankGroups(x, 2))
Exception: cannot handle a non-unique multi-index!

这就是原因：

In[42]: df.groupby(level=0).apply(lambda x: createRankGroups(x, 2)).head()
Out[42]: 
year  year
2000  2000    2
      2000    1
      2000    1
      2000    1
      2000    1

我想也许这可能是因为索引不是唯一的（因为我没有传递foo，所以我也试过了：

In[43]: df = df.reset_index().set_index(['year', 'foo'])
In[44]: df.groupby(level=0).apply(lambda x: createRankGroups(x, 2)).head()
Out[44]: 
year  year  foo 
2000  2000  10      2
            7010    1
            7000    1
            6940    1
            6840    1

最后，df.sort_index(level=0, inplace=True)强制对索引进行排序也没有解决问题。我该怎么办？

Groupby给了我额外的指数水平

0 个答案: