Question

请参见下面的示例，如何在原始MultiIndex的所有3个级别上从groupby返回数据？

在此示例中：我想按品牌查看总计。现在，我已经使用map应用了一种解决方法（请参见下文，这显示了我希望直接从groupby获得的输出）。

brands = ['Tesla','Tesla','Tesla','Peugeot', 'Peugeot', 'Citroen', 'Opel', 'Opel', 'Peugeot', 'Citroen', 'Opel']
years = [2018, 2017,2016, 2018, 2017, 2017, 2018, 2017,2016, 2016,2016]
owners = ['Tesla','Tesla','Tesla','PSA', 'PSA', 'PSA', 'PSA', 'PSA','PSA', 'PSA', 'PSA']
index = pd.MultiIndex.from_arrays([owners, years, brands], names=['owner', 'year', 'brand'])
data = np.random.randint(low=100, high=1000, size=len(index), dtype=int)
weight = np.random.randint(low=1, high=10, size=len(index), dtype=int)
df = pd.DataFrame({'data': data, 'weight': weight},index=index)
df.loc[('PSA', 2017, 'Opel'), 'data'] = np.nan
df.loc[('PSA', 2016, 'Opel'), 'data'] = np.nan
df.loc[('PSA', 2016, 'Citroen'), 'data'] = np.nan
df.loc[('Tesla', 2016, 'Tesla'), 'data'] = np.nan

退出：

                        data    weight
owner   year    brand       
PSA     2016    Citroen NaN     5
                Opel    NaN     5
                Peugeot 250.0   2
        2017    Citroen 469.0   4
                Opel    NaN     5
                Peugeot 768.0   5
        2018    Opel    237.0   6
                Peugeot 663.0   4
Tesla   2016    Tesla   NaN     3
        2017    Tesla   695.0   6
        2018    Tesla   371.0   5

我尝试使用索引和“级别”以及列和“ by”。我尝试使用“ as_index = False” .sum（）以及“ group_keys（）” = False和.apply（sum）。但是我无法在groupby输出中重新获得品牌列：

grouped = df.groupby(level=['owner', 'year'], group_keys=False) #type: <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
grouped.apply(sum)

退出：

        data    weight  group_data
owner   year            
PSA     2016    250.0   12.0    750.0
        2017    1237.0  14.0    3711.0
        2018    900.0   10.0    1800.0
Tesla   2016    0.0     3.0     0.0
        2017    695.0   6.0     695.0
        2018    371.0   5.0     371.0

类似：

grouped = df.groupby(by=['owner', 'year'], group_keys=False) #type: <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
grouped.apply(sum)

或：

grouped = df.groupby(by=['owner', 'year'], as_index=False, group_keys=False) #type: <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
grouped.sum()

解决方法：

grouped = df.groupby(level=['owner', 'year'], group_keys=False) #type: <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
df_owner_year = grouped.apply(sum)
s_data = df_owner_year['data']
df['group_data'] = df.index.map(s_data)
df

退出：

                        data    weight  group_data
owner   year    brand           
PSA     2016    Citroen NaN     5   250.0
                Opel    NaN     5   250.0
                Peugeot 250.0   2   250.0
        2017    Citroen 469.0   4   1237.0
                Opel    NaN     5   1237.0
                Peugeot 768.0   5   1237.0
        2018    Opel    237.0   6   900.0
                Peugeot 663.0   4   900.0
Tesla   2016    Tesla   NaN     3   0.0
        2017    Tesla   695.0   6   695.0
        2018    Tesla   371.0   5   371.0

Answer 1

您可以使用groupby完成此操作。

df = df.sort_index()
print(df)

                     data  weight
owner year brand                 
PSA   2016 Citroen    NaN       4
           Opel       NaN       7
           Peugeot  880.0       1
      2017 Citroen  164.0       2
           Opel       NaN       5
           Peugeot  607.0       8
      2018 Opel     809.0       1
           Peugeot  317.0       8
Tesla 2016 Tesla      NaN       1
      2017 Tesla    384.0       9
      2018 Tesla    550.0       9

Groupby Owner和Year，然后使新列等于该列。

df['new'] = df.groupby(['owner', 'year'])['data'].sum()
print(df)

                    data  weight     new
owner year brand                         
PSA   2016 Citroen    NaN       4   880.0
           Opel       NaN       7   880.0
           Peugeot  880.0       1   880.0
      2017 Citroen  164.0       2   771.0
           Opel       NaN       5   771.0
           Peugeot  607.0       8   771.0
      2018 Opel     809.0       1  1126.0
           Peugeot  317.0       8  1126.0
Tesla 2016 Tesla      NaN       1     0.0
      2017 Tesla    384.0       9   384.0
      2018 Tesla    550.0       9   550.0

编辑

又问了一个问题，为什么按列df ['new']分组时返回NaN，但是当分组在索引中时返回正确的值。我在SO上提出了这个问题，一个很好的答案是@Jezrael的here。

Answer 2

我确信在某些情况下MultiIndex很有用，但是我通常只是想尽快摆脱它，因此我将从Laravel Auth开始。

然后，您可以轻松地按@csrf分组，例如：

df = df.reset_index()

或按所有者和年份分组：

brand

pandas groupby返回原始MultiIndex

2 个答案: