Question

我正在关注本教程：GitHub Link

如果您向下滚动（Ctrl + F：练习：选择评论最多的啤酒）到Exercise: Select the most-reviewd beers部分：

数据框是多索引的：

选择评价最多的啤酒：

top_beers = df['beer_id'].value_counts().head(10).index
reviews.loc[pd.IndexSlice[:, top_beers], ['beer_name', 'beer_style']]

我的问题是如何使用IndexSlice，你怎么能在top_beers之后跳过冒号并且代码仍在运行？

reviews.loc[pd.IndexSlice[:, top_beers, :], ['beer_name', 'beer_style']]

有三个索引，pofile_name，beed_id和time。为什么pd.IndexSlice[:, top_beers]有效（没有说明如何处理时间列）？

Answer 1

为补充前面的答案，让我解释itertools.product的工作原理以及为什么有用。

关于它的实现，没有太多要说的了。正如您在source中所阅读的那样，它仅执行以下操作：

pd.IndexSlice

由此可见，class IndexSlice(object): def __getitem__(self, arg): return arg仅转发了pd.IndexSlice收到的参数。看起来很傻，不是吗？但是，它实际上可以执行某些操作。

您已经知道，如果您通过对象的括号运算符__getitem__访问对象obj，则会调用obj.__getitem__(arg)。对于序列类型的对象，obj[arg]可以是整数，也可以是slice object。我们很少自己构造切片。相反，我们将为此使用切片运算符arg（又称省略号），例如:。

这就是重点。 python解释器在调用对象的obj[0:5]方法之前将这些分片运算符:转换为分片对象。因此，__getitem__(arg)的返回值实际上将是一个切片，一个整数（如果未使用IndexSlice.__getItem__()）或其中的一个元组（如果传递了多个参数）。总之，:的唯一目的是我们不必自己构造切片。此行为对IndexSlice尤其有用。

首先让我们看一下以下示例：

pd.DataFrame.loc

因此所有冒号import pandas as pd idx = pd.IndexSlice print(idx[0]) # 0 print(idx[0,'a']) # (0, 'a') print(idx[:]) # slice(None, None, None) print(idx[0:3]) # slice(0, 3, None) print(idx[0:3,'a':'c']) # (slice(0, 3, None), slice('a', 'c', None))都被转换为相应的切片对象。如果将多个参数传递给索引运算符，则这些参数将作为n元组返回。

为演示如何将其用于具有多级索引的熊猫数据帧:，让我们看一下以下内容。

df

因此，总而言之，# Let's first construct a table with a three-level # row-index, and single-level column index. import numpy as np level0 = range(0,10) level1 = list('abcdef') level2 = ['I', 'II', 'III', 'IV'] mi = pd.MultiIndex.from_product([level0, level1, level2]) df = pd.DataFrame(np.random.random([len(mi),2]), index=mi, columns=['col1', 'col2']) # Return 'col1', select all rows. df.loc[:,'col1'] # pd.Series # Note: in the above example, the returned value has type # pd.Series, because only one column is returned. One can # enforce the returned object to be a data-frame: df.loc[:,['col1']] # pd.DataFrame, or df.loc[:,'col1'].to_frame() # # Select all rows with top-level values 0:3. df.loc[0:3, 'col1'] # If we want to create a slice for multiple index levels # we need to pass somehow a list of slices. The following # however leads to a SyntaxError because the slice # operator ':' cannot be placed inside a list declaration. df.loc[[0:3, 'a':'c'], 'col1'] # The following is valid python code, but looks clumsy: df.loc[(slice(0, 3, None), slice('a', 'c', None)), 'col1'] # Here is why pd.IndexSlice is useful. It helps # to create a slice that makes use of two index-levels. df.loc[idx[0:3, 'a':'c'], 'col1'] # We can also expand the slice specification by third level. df.loc[idx[0:3, 'a':'c', 'I':'III'], 'col1'] # A solitary slicing operator ':' means: take them all. # It is equivalent to slice(None). df.loc[idx[0:3, 'a':'c', :], 'col1'] # pd.Series # Semantically, this is equivalent to the following, # because the last ':' in the previous example does # not add any information about the slice specification. df.loc[idx[0:3, 'a':'c'], 'col1'] # pd.Series # The following lines are also equivalent, but # both expressions evaluate to a result with multiple columns. df.loc[idx[0:3, 'a':'c', :], :] # pd.DataFrame df.loc[idx[0:3, 'a':'c'], :] # pd.DataFrame有助于在为行和列索引指定切片时提高可读性。

大熊猫随后对这些切片规格的处理是另一回事。从本质上讲，它从最上层的索引级别开始选择行/列，并在向下移动到更低级别时减少选择，具体取决于已指定的级别数。 pd.IndexSlice是具有自己的pd.DataFrame.loc功能的对象，可以完成所有这些操作。

正如您在评论中已经指出的那样，在某些特殊情况下，熊猫似乎表现得很怪异。您提到的两个示例实际上将得出相同的结果。但是，熊猫在内部对它们的处理方式有所不同。

__getitem__()

诚然，区别是微妙的。

Answer 2

Pandas只需要指定足够多的MultiIndex级别来消除歧义。由于你在第二级切片，你需要第一个:来说我没有在这个级别上过滤。

未指定的任何其他级别将完整返回，因此相当于每个级别的:。

Pandas IndexSlice是如何工作的

2 个答案: