更新

Question

说我有一个包含A，B，C和data列的数据框。

我想：

将其转换为索引为A，B和C
按此数据框的指数 A和B对行进行排序。
在索引的每个A B对中，按C列上的值对行（即data索引）进行排序
根据之前的数据排序，获取每个此类A B对中的前20行。

这应该不难，但我尝试了各种各样的方法，而且没有一个能给我我想要的东西。例如，以下内容非常接近，但它仅为我提供了第一组A B索引的值。

temp = mdf.set_index(['A', 'B','C']).sort_index()

# Sorting by value and retrieving the top 20 entries:
func = lambda x: x.sort('data', ascending=False).head(20)
temp = temp.groupby(level=['A','B'],as_index=False).apply(func)

# Drop the dummy index (?) introduced in the line above
temp = temp.reset_index(level=0)['data']

更新

def create_random_multi_index():
  df = pd.DataFrame({'A' : [np.random.random_integers(10) for x in xrange(500)], 
                     'B' : [np.random.random_integers(10) for x in xrange(500)], 
                     'C' : [np.random.random_integers(10) for x in xrange(500)],
                     'data' : randn(500) })

  return df

E.g。我想要的是什么（显示前3个元素，请注意数据在每个A-B对中的排序方式）：

             data
A B  C           
1 1  10  2.057864
     5   1.234252
     7   0.235246
  2  7   1.309126
     6   0.450208
     8   0.397360
2 2  2   1.609126
     1   0.250208
     4   0.597360
...

Answer 1

不确定我100％理解你想要什么，但我认为这样做会有。重置时，它保持相同的顺序。关键是sortlevel()，它以水平方式对级别（以及关系中的剩余级别）进行排序。在0.14（即将推出）中，他们是一个选项sort_remaining，我认为你可以选择它。

In [48]: np.random.seed(1234)

In [49]:  df = pd.DataFrame({'A' : [np.random.random_integers(10) for x in xrange(500)], 
   ....:                      'B' : [np.random.random_integers(10) for x in xrange(500)], 
   ....:                      'C' : [np.random.random_integers(10) for x in xrange(500)],
   ....:                      'data' : randn(500) })

首先设置索引，然后对其进行排序并重置。

然后分组A，B并拉出前20个最大元素。

df.set_index(['A','B','C']).sortlevel().reset_index().groupby(
             ['A','B']).apply(lambda x: x.sort(columns='data',ascending=False).head(20)).set_index(['A','B','C'])
Out[8]: 
             data
A B  C           
1 1  1   0.959688
     2   0.918230
     2   0.731919
     10  0.212463
     1   0.103644
     1  -0.035266
  2  8   1.459579
     8   1.277935
     5  -0.075886
     2  -0.684101
     3  -0.928110
  3  5   0.675987
     4   0.065301
     5  -0.800067
     7  -1.349503
  4  4   1.167308
     8   1.148327
     9   0.417590
     6  -1.274146
     10 -2.656304
  5  2  -0.962994
     1  -0.982679
  6  2   1.410920
     6   1.352527
     10  0.510330
     4   0.033275
     1  -0.679686
     10 -0.896797
     1  -2.858669
  7  8  -0.219342
     8  -0.591054
     2  -0.773227
     1  -0.781850
     3  -1.259089
     10 -1.387992
     10 -1.891734
  8  7   1.578855
     2  -0.498898
  9  3   0.644277
     8   0.572177
     2   0.058431
     9  -0.146912
     4  -0.334690
  10 9   0.795346
     8  -0.137661
     10 -1.335385
2 1  9   1.309405
     3   0.328546
     5   0.198422
     1  -0.561974
     3  -0.578069
  2  5   0.645426
     1  -0.138808
     5  -0.400199
     5  -0.513738
     10 -0.667343
     9  -1.983470
  3  3   1.210882
     6   0.894201
     3   0.743652
              ...

[500 rows x 1 columns]

Answer 2

试试这个

df.sort('data', ascending=False).set_index('C').groupby(['A', 'B']).data.head(3)

它不是最易读的语法，但会完成工作

A  B  C
1  1  9     1.380526
      1     0.903524
      7    -0.112363
   2  2     0.284057
      5     0.131392
      1     0.111512

将索引和值排序与top-K选择相结合

更新

2 个答案: