Question

我有Python 3.6和Pandas 19.0。我正在为数据帧使用多个索引。

iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two', 'three','four']]
pd.MultiIndex.from_product(iterables, names=['first', 'second'])
arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux', 'bar', 'foo']),
          np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two', 'three', 'four'])]
s = pd.DataFrame(np.random.randn(10), index=arrays)

我知道如何基于其中一个索引的单个值获取子集，例如

s.loc[s.index.get_level_values(0)=='bar']
Out[16]: 
                  0
bar one    1.409395
    two    0.837486
    three  1.290018

如何基于与单个索引对应的一组值来获取子集？显然，以下语法不起作用：

my_subset = set(['three', 'one'])
s.loc[s.index.get_level_values(1) in my_subset]

编辑：

大型数据帧最快的解决方案是什么？

Answer 1

使用Index.isin，然后按1选择第二级：

my_subset = set(['three', 'one'])
a = s.loc[s.index.get_level_values(1).isin(my_subset)]
print (a)

                  0
bar one   -0.372206
baz one    0.886271
foo one   -2.231380
qux one    0.960636
bar three  1.272873

性能：取决于匹配值的数量和行数：

N = 10000
a = ['bar', 'baz', 'foo', 'qux']
b = ['one', 'two', 'three','four']
arrays = pd.MultiIndex.from_arrays([np.random.choice(a, size=N),
                                     np.random.choice(b, size=N)], names=['first', 'second'])

s = pd.DataFrame(np.random.randn(N), index=arrays).sort_index()
print (s)

my_subset1 = set(['three', 'one'])
my_subset2 = ['three', 'one']

In [209]: %timeit s.loc[s.index.get_level_values(1).isin(my_subset1)]
866 µs ± 59.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [210]: %timeit s.query('second in @my_subset2')
2.19 ms ± 47.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Answer 2

您可以query数据框。假设如您的示例所示，索引的第二级名为second：

my_subset = ['three', 'one']

res = s.query('second in @my_subset')

熊猫多索引子集选择

2 个答案: