Question

我正在使用multiIndex DataFrame，并希望执行一些groupby / apply（）操作。我正在努力如何结合groupby和apply。

我想提取我的DataFrame的两个索引的值，并在apply函数中比较这些值。

对于apply函数为true的那些事件，我想对我的DataFrame的值进行groupby / sum。

有没有一种很好的方法可以在不使用for循环的情况下执行此操作？

 # Index specifier
ix = pd.MultiIndex.from_product(
    [['2015', '2016', '2017', '2018'],
     ['2016', '2017', '2018', '2019', '2020'],
     ['A', 'B', 'C']],
    names=['SimulationStart', 'ProjectionPeriod', 'Group']
)

df = pd.DataFrame(np.random.randn(60,1), index= ix, columns=['Input'])

# Calculate sum over all projection periods for each simulation/group
all_periods = df.groupby(level=['SimulationStart', 'Group']).sum()

# This part of the code is not working yet
# is there a way to extract data from the indices of the DataFrame?
# Calculate sum over all projection periods for each simulation/group;
# where projection period is a maximum of one year in the future
one_year_ahead = df.groupby(level=['SimulationStart', 'Group']) \
                   .apply(lambda x: x['ProjectionPeriod'] - \
                                    x['SimulationStart'] <= 1).sum()

Answer 1

您可以在执行ProjectionPeriod - SimulationStart操作之前计算差异groupby/sum，。

get_values = df.index.get_level_values mask = (get_values('ProjectionPeriod') - get_values('SimulationStart')) <= 1 one_year_ahead = df.loc[mask].groupby(level=['SimulationStart', 'Group']).sum()

import numpy as np import pandas as pd ix = pd.MultiIndex.from_product( [[2015, 2016, 2017, 2018], [2016, 2017, 2018, 2019, 2020], ['A', 'B', 'C']], names=['SimulationStart', 'ProjectionPeriod', 'Group']) df = pd.DataFrame(np.random.randn(60,1), index= ix, columns=['Input']) get_values = df.index.get_level_values mask = (get_values('ProjectionPeriod') - get_values('SimulationStart')) <= 1 one_year_ahead = df.loc[mask].groupby(level=['SimulationStart', 'Group']).sum() print(one_year_ahead)

产量

Input SimulationStart Group 2015 A 0.821851 B -0.643342 C -0.140112 2016 A 0.384885 B -0.252186 C -1.057493 2017 A -1.055933 B 1.096221 C -4.150002 2018 A 0.584859 B -4.062078 C 1.225105

Answer 2

这是一种方法。

- name: Install EPEL repo.
  yum:
    name: "{{ epel_repo_url }}"
    state: present
    register: result
    until: '"failed" not in result'
    retries: 5
    delay: 10

因为您在df.reset_index().query('ProjectionPeriod - SimulationStart == 1') \ .groupby(['SimulationStart', 'Group']).Input.sum() SimulationStart Group 2015 A 1.100246 B -0.605710 C 1.366465 2016 A 0.359406 B -2.077444 C -0.004356 2017 A 0.604497 B -0.362941 C 0.103945 2018 A -0.861976 B -0.737274 C 0.237512 Name: Input, dtype: float64列中有唯一的值，这也有效，但我不相信它的内容。

Group

组合groupby并应用于multiIndex DataFrames

2 个答案: