Question

使用带有分层索引的数据，有没有办法轻松选择一系列值？我见过的所有方法，包括xs和.loc，似乎仅限于一个值，请参阅Benefits of panda's multiindex?。使用此示例数据

from pandas import *
from numpy import *
import itertools as it

M = 100 # Number of rows to generate

# Create some test data with multiindex
df = DataFrame(randn(M, 10))
df.index = [randint(4, size=M), randint(8, size=M)]
df.index.rename(['a', 'b'])

我希望能够选择第一个索引为1或2且第二个索引为3或4的所有内容。我最接近的是使用.loc列表元组

# Now extract a subset
part = df.loc[[(1, 3), (1,4), (2,3), (2,4)]]

但这会产生一些奇怪的行为，

# The old indices are still shown for some reason
print(part.index.levels)

# Good indexing
print("correct:\n", part.loc[(1, 1)])
# No keyerror, although the key wasn't included
print("wrong:\n", part.loc[[(0, 3)]])   
# Indexing of first index, and then a column, very odd
print("odd:\n", part.loc[(1, 9)])
# But there is an error accessing the original this way
print("Expected error:\n", df.loc[(1, 9)])

输出：

In [436]: [[0, 1, 2, 3], [0, 1, 2, 3, 4, 5, 6, 7]]
correct:
             0         1         2         3         4         5         6  \
1 3 -0.183667  0.578867 -0.944514  0.026295  0.778354  0.603845  0.636486   
  3 -0.337596  0.018084 -0.654721 -1.121475 -0.561706  0.695095 -0.512936   
  3 -0.670779 -0.425093  1.262278 -1.806815  0.855900 -0.230683 -0.225658   
  3 -0.274808 -0.529901  1.265333  0.559646 -1.418687  0.492577  0.141648   

            7         8         9  
1 3  1.109179 -1.569236 -0.617408  
  3 -0.659310  1.249105  0.032657  
  3  0.315601  1.100192 -0.389736  
  3 -0.267462 -0.025189  0.069047  
odd:
 3   -0.617408
3    0.032657
3   -0.389736
3    0.069047
4    0.217577
4   -0.232357
Name: 9, dtype: float64
wrong:
       0   1   2   3   4   5   6   7   8   9
0 3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
(truncated)

那么有没有比元组列表更好的方法来访问分层索引的多个部分？如果没有，有没有办法在使用元组索引后清理结果，以便给出合理的错误，而不是NaN？

Answer 1

你可以使用pd.IndexSlice来获得更多人类可读的切片

In [52]: idx = pd.IndexSlice

In [53]: dfmi.loc[idx[:,:,['C1','C3']],idx[:,'foo']]
Out[53]: 
lvl0           a    b
lvl1         foo  foo
A0 B0 C1 D0    8   10
         D1   12   14
      C3 D0   24   26
         D1   28   30
   B1 C1 D0   40   42
         D1   44   46
      C3 D0   56   58
...          ...  ...
A3 B0 C1 D1  204  206
      C3 D0  216  218
         D1  220  222
   B1 C1 D0  232  234
         D1  236  238
      C3 D0  248  250
         D1  252  254

[32 rows x 2 columns]

见http://pandas.pydata.org/pandas-docs/stable/advanced.html#using-slicers

从多索引中选择（具有重复值）

1 个答案: