我有Python 3.6和Pandas 19.0。我正在为数据帧使用多个索引。
iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two', 'three','four']]
pd.MultiIndex.from_product(iterables, names=['first', 'second'])
arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux', 'bar', 'foo']),
np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two', 'three', 'four'])]
s = pd.DataFrame(np.random.randn(10), index=arrays)
我知道如何基于其中一个索引的单个值获取子集,例如
s.loc[s.index.get_level_values(0)=='bar']
Out[16]:
0
bar one 1.409395
two 0.837486
three 1.290018
如何基于与单个索引对应的一组值来获取子集?显然,以下语法不起作用:
my_subset = set(['three', 'one'])
s.loc[s.index.get_level_values(1) in my_subset]
编辑:
大型数据帧最快的解决方案是什么?
答案 0 :(得分:1)
使用Index.isin
,然后按1
选择第二级:
my_subset = set(['three', 'one'])
a = s.loc[s.index.get_level_values(1).isin(my_subset)]
print (a)
0
bar one -0.372206
baz one 0.886271
foo one -2.231380
qux one 0.960636
bar three 1.272873
性能:取决于匹配值的数量和行数:
N = 10000
a = ['bar', 'baz', 'foo', 'qux']
b = ['one', 'two', 'three','four']
arrays = pd.MultiIndex.from_arrays([np.random.choice(a, size=N),
np.random.choice(b, size=N)], names=['first', 'second'])
s = pd.DataFrame(np.random.randn(N), index=arrays).sort_index()
print (s)
my_subset1 = set(['three', 'one'])
my_subset2 = ['three', 'one']
In [209]: %timeit s.loc[s.index.get_level_values(1).isin(my_subset1)]
866 µs ± 59.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [210]: %timeit s.query('second in @my_subset2')
2.19 ms ± 47.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
答案 1 :(得分:1)
您可以query
数据框。假设如您的示例所示,索引的第二级名为second
:
my_subset = ['three', 'one']
res = s.query('second in @my_subset')