pandas groupby返回原始MultiIndex

时间:2019-05-12 12:15:17

标签: python-3.x pandas

请参见下面的示例,如何在原始MultiIndex的所有3个级别上从groupby返回数据?

在此示例中:我想按品牌查看总计。现在,我已经使用map应用了一种解决方法(请参见下文,这显示了我希望直接从groupby获得的输出)。

brands = ['Tesla','Tesla','Tesla','Peugeot', 'Peugeot', 'Citroen', 'Opel', 'Opel', 'Peugeot', 'Citroen', 'Opel']
years = [2018, 2017,2016, 2018, 2017, 2017, 2018, 2017,2016, 2016,2016]
owners = ['Tesla','Tesla','Tesla','PSA', 'PSA', 'PSA', 'PSA', 'PSA','PSA', 'PSA', 'PSA']
index = pd.MultiIndex.from_arrays([owners, years, brands], names=['owner', 'year', 'brand'])
data = np.random.randint(low=100, high=1000, size=len(index), dtype=int)
weight = np.random.randint(low=1, high=10, size=len(index), dtype=int)
df = pd.DataFrame({'data': data, 'weight': weight},index=index)
df.loc[('PSA', 2017, 'Opel'), 'data'] = np.nan
df.loc[('PSA', 2016, 'Opel'), 'data'] = np.nan
df.loc[('PSA', 2016, 'Citroen'), 'data'] = np.nan
df.loc[('Tesla', 2016, 'Tesla'), 'data'] = np.nan

退出:

                        data    weight
owner   year    brand       
PSA     2016    Citroen NaN     5
                Opel    NaN     5
                Peugeot 250.0   2
        2017    Citroen 469.0   4
                Opel    NaN     5
                Peugeot 768.0   5
        2018    Opel    237.0   6
                Peugeot 663.0   4
Tesla   2016    Tesla   NaN     3
        2017    Tesla   695.0   6
        2018    Tesla   371.0   5

我尝试使用索引和“级别”以及列和“ by”。 我尝试使用“ as_index = False” .sum()以及“ group_keys()” = False和.apply(sum)。但是我无法在groupby输出中重新获得品牌列:

grouped = df.groupby(level=['owner', 'year'], group_keys=False) #type: <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
grouped.apply(sum)

退出:

        data    weight  group_data
owner   year            
PSA     2016    250.0   12.0    750.0
        2017    1237.0  14.0    3711.0
        2018    900.0   10.0    1800.0
Tesla   2016    0.0     3.0     0.0
        2017    695.0   6.0     695.0
        2018    371.0   5.0     371.0

类似:

grouped = df.groupby(by=['owner', 'year'], group_keys=False) #type: <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
grouped.apply(sum)

或:

grouped = df.groupby(by=['owner', 'year'], as_index=False, group_keys=False) #type: <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
grouped.sum()

解决方法:

grouped = df.groupby(level=['owner', 'year'], group_keys=False) #type: <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
df_owner_year = grouped.apply(sum)
s_data = df_owner_year['data']
df['group_data'] = df.index.map(s_data)
df

退出:

                        data    weight  group_data
owner   year    brand           
PSA     2016    Citroen NaN     5   250.0
                Opel    NaN     5   250.0
                Peugeot 250.0   2   250.0
        2017    Citroen 469.0   4   1237.0
                Opel    NaN     5   1237.0
                Peugeot 768.0   5   1237.0
        2018    Opel    237.0   6   900.0
                Peugeot 663.0   4   900.0
Tesla   2016    Tesla   NaN     3   0.0
        2017    Tesla   695.0   6   695.0
        2018    Tesla   371.0   5   371.0

2 个答案:

答案 0 :(得分:2)

您可以使用groupby完成此操作。

df = df.sort_index()
print(df)

                     data  weight
owner year brand                 
PSA   2016 Citroen    NaN       4
           Opel       NaN       7
           Peugeot  880.0       1
      2017 Citroen  164.0       2
           Opel       NaN       5
           Peugeot  607.0       8
      2018 Opel     809.0       1
           Peugeot  317.0       8
Tesla 2016 Tesla      NaN       1
      2017 Tesla    384.0       9
      2018 Tesla    550.0       9

Groupby Owner和Year,然后使新列等于该列。

df['new'] = df.groupby(['owner', 'year'])['data'].sum()
print(df)

                    data  weight     new
owner year brand                         
PSA   2016 Citroen    NaN       4   880.0
           Opel       NaN       7   880.0
           Peugeot  880.0       1   880.0
      2017 Citroen  164.0       2   771.0
           Opel       NaN       5   771.0
           Peugeot  607.0       8   771.0
      2018 Opel     809.0       1  1126.0
           Peugeot  317.0       8  1126.0
Tesla 2016 Tesla      NaN       1     0.0
      2017 Tesla    384.0       9   384.0
      2018 Tesla    550.0       9   550.0

编辑

又问了一个问题,为什么按列df ['new']分组时返回NaN,但是当分组在索引中时返回正确的值。我在SO上提出了这个问题,一个很好的答案是@Jezrael的here

答案 1 :(得分:1)

我确信在某些情况下MultiIndex很有用,但是我通常只是想尽快摆脱它,因此我将从Laravel Auth开始。

然后,您可以轻松地按@csrf分组,例如:

df = df.reset_index()

或按所有者和年份分组:

brand