单一索引

Question

我对Python＆amp;大熊猫和我正在努力（分层）指数。我已经掌握了基础知识，但是因为更高级的切片和横截面而丢失了。

例如，使用以下数据框

import pandas as pd
import numpy as np
data = pd.DataFrame(np.arange(9).reshape((3, 3)),
    index=pd.Index(['Ohio', 'Colorado', 'New York'], name='state'), columns=pd.Index(['one', 'two', 'three'], name='number'))

我想选择除索引'Colorado'之外的所有内容。对于我可以做的小数据集：

data.ix[['Ohio','New York']]

但是如果唯一索引值的数量很大，那就不切实际了。天真地，我希望语法像

data.ix[['state' != 'Colorado']]

然而，这只会返回第一个记录'Ohio'并且不会返回'New York'。这有效，但很麻烦

filter = list(set(data.index.get_level_values(0).unique()) - set(['Colorado']))
data[filter]

肯定有更多Pythonic，冗长的方式吗？

Answer 1

这是一个Python问题，而不是pandas问题：'state' != 'Colorado'为True，因此pandas得到的是data.ix[[True]]。

你可以做到

>>> data.loc[data.index != "Colorado"]
number    one  two  three
state                    
Ohio        0    1      2
New York    6    7      8

[2 rows x 3 columns]

或使用DataFrame.query：

>>> data.query("state != 'New York'")
number    one  two  three
state                    
Ohio        0    1      2
Colorado    3    4      5

[2 rows x 3 columns]

如果您不喜欢data的重复。（引用传递给.query()方法的表达式是围绕这样一个事实的唯一方法，否则Python会在pandas看到它之前评估比较。）

Answer 2

这是一个强大的解决方案，也适用于MultiIndex对象

单一索引

excluded = ['Ohio']
indices = data.index.get_level_values('state').difference(excluded)
indx = pd.IndexSlice[indices.values]

输出

In [77]: data.loc[indx]
Out[77]:
number    one  two  three
state
Colorado    3    4      5
New York    6    7      8

MultiIndex Extension

这里我扩展到MultiIndex示例......

data = pd.DataFrame(np.arange(18).reshape(6,3), index=pd.MultiIndex(levels=[[u'AU', u'UK'], [u'Derby', u'Kensington', u'Newcastle', u'Sydney']], labels=[[0, 0, 0, 1, 1, 1], [0, 2, 3, 0, 1, 2]], names=[u'country', u'town']), columns=pd.Index(['one', 'two', 'three'], name='number'))

假设我们要从这个新的MultiIndex

中的两个示例中排除'Newcastle'

excluded = ['Newcastle']
indices = data.index.get_level_values('town').difference(excluded)
indx = pd.IndexSlice[:, indices.values]

这给出了预期的结果

In [115]: data.loc[indx, :]
Out[115]:
number              one  two  three
country town
AU      Derby         0    1      2
        Sydney        3    4      5
UK      Derby         0    1      2
        Kensington    3    4      5

常见陷阱

确保索引的所有级别都已排序，您需要data.sort_index(inplace=True)
确保为列data.loc[indx, :]
有时indx = pd.IndexSlice[:, indices]就足够了，但我发现我经常需要使用indx = pd.IndexSlice[:, indices.values]

排除pandas数据帧中索引行的最有效方法

2 个答案:

单一索引

MultiIndex Extension

常见陷阱