组合groupby并应用于multiIndex DataFrames

时间:2016-11-03 19:27:19

标签: python pandas dataframe group-by

我正在使用multiIndex DataFrame,并希望执行一些groupby / apply()操作。我正在努力如何结合groupby和apply。

我想提取我的DataFrame的两个索引的值,并在apply函数中比较这些值。

对于apply函数为true的那些事件,我想对我的DataFrame的值进行groupby / sum。

有没有一种很好的方法可以在不使用for循环的情况下执行此操作?

 # Index specifier
ix = pd.MultiIndex.from_product(
    [['2015', '2016', '2017', '2018'],
     ['2016', '2017', '2018', '2019', '2020'],
     ['A', 'B', 'C']],
    names=['SimulationStart', 'ProjectionPeriod', 'Group']
)

df = pd.DataFrame(np.random.randn(60,1), index= ix, columns=['Input'])

# Calculate sum over all projection periods for each simulation/group
all_periods = df.groupby(level=['SimulationStart', 'Group']).sum()

# This part of the code is not working yet
# is there a way to extract data from the indices of the DataFrame?
# Calculate sum over all projection periods for each simulation/group;
# where projection period is a maximum of one year in the future
one_year_ahead = df.groupby(level=['SimulationStart', 'Group']) \
                   .apply(lambda x: x['ProjectionPeriod'] - \
                                    x['SimulationStart'] <= 1).sum()

2 个答案:

答案 0 :(得分:4)

您可以在执行ProjectionPeriod - SimulationStart操作之前计算差异groupby/sum

get_values = df.index.get_level_values
mask = (get_values('ProjectionPeriod') - get_values('SimulationStart')) <= 1
one_year_ahead = df.loc[mask].groupby(level=['SimulationStart', 'Group']).sum()
import numpy as np
import pandas as pd
ix = pd.MultiIndex.from_product(
    [[2015, 2016, 2017, 2018], 
     [2016, 2017, 2018, 2019, 2020], ['A', 'B', 'C']],
    names=['SimulationStart', 'ProjectionPeriod', 'Group'])
df = pd.DataFrame(np.random.randn(60,1), index= ix, columns=['Input'])

get_values = df.index.get_level_values
mask = (get_values('ProjectionPeriod') - get_values('SimulationStart')) <= 1
one_year_ahead = df.loc[mask].groupby(level=['SimulationStart', 'Group']).sum()
print(one_year_ahead)

产量

                          Input
SimulationStart Group          
2015            A      0.821851
                B     -0.643342
                C     -0.140112
2016            A      0.384885
                B     -0.252186
                C     -1.057493
2017            A     -1.055933
                B      1.096221
                C     -4.150002
2018            A      0.584859
                B     -4.062078
                C      1.225105

答案 1 :(得分:3)

这是一种方法。

- name: Install EPEL repo.
  yum:
    name: "{{ epel_repo_url }}"
    state: present
    register: result
    until: '"failed" not in result'
    retries: 5
    delay: 10

因为您在df.reset_index().query('ProjectionPeriod - SimulationStart == 1') \ .groupby(['SimulationStart', 'Group']).Input.sum() SimulationStart Group 2015 A 1.100246 B -0.605710 C 1.366465 2016 A 0.359406 B -2.077444 C -0.004356 2017 A 0.604497 B -0.362941 C 0.103945 2018 A -0.861976 B -0.737274 C 0.237512 Name: Input, dtype: float64 列中有唯一的值,这也有效,但我不相信它的内容。

Group

enter image description here