请参见下面的示例,如何在原始MultiIndex的所有3个级别上从groupby返回数据?
在此示例中:我想按品牌查看总计。现在,我已经使用map应用了一种解决方法(请参见下文,这显示了我希望直接从groupby获得的输出)。
brands = ['Tesla','Tesla','Tesla','Peugeot', 'Peugeot', 'Citroen', 'Opel', 'Opel', 'Peugeot', 'Citroen', 'Opel']
years = [2018, 2017,2016, 2018, 2017, 2017, 2018, 2017,2016, 2016,2016]
owners = ['Tesla','Tesla','Tesla','PSA', 'PSA', 'PSA', 'PSA', 'PSA','PSA', 'PSA', 'PSA']
index = pd.MultiIndex.from_arrays([owners, years, brands], names=['owner', 'year', 'brand'])
data = np.random.randint(low=100, high=1000, size=len(index), dtype=int)
weight = np.random.randint(low=1, high=10, size=len(index), dtype=int)
df = pd.DataFrame({'data': data, 'weight': weight},index=index)
df.loc[('PSA', 2017, 'Opel'), 'data'] = np.nan
df.loc[('PSA', 2016, 'Opel'), 'data'] = np.nan
df.loc[('PSA', 2016, 'Citroen'), 'data'] = np.nan
df.loc[('Tesla', 2016, 'Tesla'), 'data'] = np.nan
退出:
data weight
owner year brand
PSA 2016 Citroen NaN 5
Opel NaN 5
Peugeot 250.0 2
2017 Citroen 469.0 4
Opel NaN 5
Peugeot 768.0 5
2018 Opel 237.0 6
Peugeot 663.0 4
Tesla 2016 Tesla NaN 3
2017 Tesla 695.0 6
2018 Tesla 371.0 5
我尝试使用索引和“级别”以及列和“ by”。 我尝试使用“ as_index = False” .sum()以及“ group_keys()” = False和.apply(sum)。但是我无法在groupby输出中重新获得品牌列:
grouped = df.groupby(level=['owner', 'year'], group_keys=False) #type: <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
grouped.apply(sum)
退出:
data weight group_data
owner year
PSA 2016 250.0 12.0 750.0
2017 1237.0 14.0 3711.0
2018 900.0 10.0 1800.0
Tesla 2016 0.0 3.0 0.0
2017 695.0 6.0 695.0
2018 371.0 5.0 371.0
类似:
grouped = df.groupby(by=['owner', 'year'], group_keys=False) #type: <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
grouped.apply(sum)
或:
grouped = df.groupby(by=['owner', 'year'], as_index=False, group_keys=False) #type: <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
grouped.sum()
解决方法:
grouped = df.groupby(level=['owner', 'year'], group_keys=False) #type: <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
df_owner_year = grouped.apply(sum)
s_data = df_owner_year['data']
df['group_data'] = df.index.map(s_data)
df
退出:
data weight group_data
owner year brand
PSA 2016 Citroen NaN 5 250.0
Opel NaN 5 250.0
Peugeot 250.0 2 250.0
2017 Citroen 469.0 4 1237.0
Opel NaN 5 1237.0
Peugeot 768.0 5 1237.0
2018 Opel 237.0 6 900.0
Peugeot 663.0 4 900.0
Tesla 2016 Tesla NaN 3 0.0
2017 Tesla 695.0 6 695.0
2018 Tesla 371.0 5 371.0
答案 0 :(得分:2)
您可以使用groupby完成此操作。
df = df.sort_index()
print(df)
data weight
owner year brand
PSA 2016 Citroen NaN 4
Opel NaN 7
Peugeot 880.0 1
2017 Citroen 164.0 2
Opel NaN 5
Peugeot 607.0 8
2018 Opel 809.0 1
Peugeot 317.0 8
Tesla 2016 Tesla NaN 1
2017 Tesla 384.0 9
2018 Tesla 550.0 9
Groupby Owner和Year,然后使新列等于该列。
df['new'] = df.groupby(['owner', 'year'])['data'].sum()
print(df)
data weight new
owner year brand
PSA 2016 Citroen NaN 4 880.0
Opel NaN 7 880.0
Peugeot 880.0 1 880.0
2017 Citroen 164.0 2 771.0
Opel NaN 5 771.0
Peugeot 607.0 8 771.0
2018 Opel 809.0 1 1126.0
Peugeot 317.0 8 1126.0
Tesla 2016 Tesla NaN 1 0.0
2017 Tesla 384.0 9 384.0
2018 Tesla 550.0 9 550.0
编辑
又问了一个问题,为什么按列df ['new']分组时返回NaN,但是当分组在索引中时返回正确的值。我在SO上提出了这个问题,一个很好的答案是@Jezrael的here。
答案 1 :(得分:1)
我确信在某些情况下MultiIndex很有用,但是我通常只是想尽快摆脱它,因此我将从Laravel Auth
开始。
然后,您可以轻松地按@csrf
分组,例如:
df = df.reset_index()
或按所有者和年份分组:
brand