Question

我正在使用Pandas MultiIndex Dataframes几周，我觉得我并没有真正了解GroupBy对象背后的直觉，特别是选择组。

我们以此代码为例：

import numpy as np
import pandas as pd

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]

tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

s = pd.Series(np.random.randn(8), index=index)
df = pd.DataFrame(np.random.randn(8, 4), index=arrays)

df.groupby(level=0).first()

最后一行的输出是：

            0         1         2         3
bar  1.612350 -0.019424 -0.088925 -0.188864
baz  2.752485 -1.011006  0.249788  1.106547
foo  1.313016  0.716512  0.550851 -1.532394
qux  1.505173  0.758075  1.360808  1.261204

然而，在我看来，这种行为没有多大意义，因为它给了我第一组，好像我将按第二级分组。我对上述代码的期望是：

            0         1         2         3
one  1.612350 -0.019424 -0.088925 -0.188864
two  0.434829  1.698503 -0.213425  0.329733

直到现在，我通过这样做达到了我想要的目标：

list(df.groupby(level=0))[0][1]

但是，这似乎并不意味着这样做。

不知何故，似乎我对GroupBy对象有错误的期望。也许有人可以帮助我解决我的困惑:)。

其他信息： 我不是在寻找一个特定的解决方案，如何获得“第一组”，因为我已经通过从对象创建一个列表来获得它。我的问题是关于GroupBy对象的理解以及为什么它选择第一个（或任何其他组）的方式。

Answer 1

您在寻找多个索引切片吗？

df.loc[pd.IndexSlice['bar',:],:]
Out[319]: 
                0        1         2         3
bar one  0.807706  0.07296  0.638787  0.329646
    two -0.497104 -0.75407 -0.943406  0.484752

Answer 2

您的第一列是level_0，但您想按level_1分组。如果重置索引，则会为两个列分配一个列标题，您可以按

进行分组

添加以下代码：

df=df.reset_index()

df=df.groupby(['level_1']).first()
df.head()

Answer 3

您可以提供MultiIndex级别的名称，然后使用pd.DataFrame.query：

df.index.names = ['first', 'second']
res = df.query('first == "bar"')

print(res)

                     0         1         2         3
first second                                        
bar   one     0.555863 -0.080074 -1.726498 -0.874648
      two     1.099309  0.047887  0.294042  0.222972

或者，使用pd.Index.get_level_values：

res = df[df.index.get_level_values(0) == 'bar']

Answer 4

由于@ user2285236在评论中回答了我的问题，我尝试对其进行总结。

方法first()不会选择第一个组，而是选择每个组的第一个条目。没有像list(df.groupby(level=0))[0][1]之类的内置实现的原因是groupby()方法对条目进行排序。

例如，让我们安排上面的示例并首先制作＆＃39;小组＆＃39; qux？。看起来像这样：

arrays = [['qux', 'qux', 'bar', 'bar', 'baz', 'baz', 'foo', 'foo'],
          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]

tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

s = pd.Series(np.random.randn(8), index=index)
df = pd.DataFrame(np.random.randn(8, 4), index=arrays)

list(df.groupby(level=0))[0][1]的来电回复：

                0         1         2         3
bar one -0.335708 -0.315253 -0.087970  0.754242
    two -1.608651  1.005786  1.800341 -1.059510

而不是第一个＆＃39;小组，我希望如此：

                0         1         2         3
qux one -0.374186  0.812865  0.578298 -0.901881
    two -0.137799  0.278797 -1.171522  0.319980

但是，可以使用内置方法get_group()调用每个组。因此，在这种情况下，我可以得到第一个＆＃39;小组致电：df.groupby(level=0).get_group('qux')

Pandas Multiindex和Groupby返回奇怪的行为

4 个答案: