我正在使用multiIndex DataFrame,并希望执行一些groupby / apply()操作。我正在努力如何结合groupby和apply。
我想提取我的DataFrame的两个索引的值,并在apply函数中比较这些值。
对于apply函数为true的那些事件,我想对我的DataFrame的值进行groupby / sum。
有没有一种很好的方法可以在不使用for循环的情况下执行此操作?
# Index specifier
ix = pd.MultiIndex.from_product(
[['2015', '2016', '2017', '2018'],
['2016', '2017', '2018', '2019', '2020'],
['A', 'B', 'C']],
names=['SimulationStart', 'ProjectionPeriod', 'Group']
)
df = pd.DataFrame(np.random.randn(60,1), index= ix, columns=['Input'])
# Calculate sum over all projection periods for each simulation/group
all_periods = df.groupby(level=['SimulationStart', 'Group']).sum()
# This part of the code is not working yet
# is there a way to extract data from the indices of the DataFrame?
# Calculate sum over all projection periods for each simulation/group;
# where projection period is a maximum of one year in the future
one_year_ahead = df.groupby(level=['SimulationStart', 'Group']) \
.apply(lambda x: x['ProjectionPeriod'] - \
x['SimulationStart'] <= 1).sum()
答案 0 :(得分:4)
您可以在执行ProjectionPeriod - SimulationStart
操作之前计算差异groupby/sum
,。
get_values = df.index.get_level_values
mask = (get_values('ProjectionPeriod') - get_values('SimulationStart')) <= 1
one_year_ahead = df.loc[mask].groupby(level=['SimulationStart', 'Group']).sum()
import numpy as np
import pandas as pd
ix = pd.MultiIndex.from_product(
[[2015, 2016, 2017, 2018],
[2016, 2017, 2018, 2019, 2020], ['A', 'B', 'C']],
names=['SimulationStart', 'ProjectionPeriod', 'Group'])
df = pd.DataFrame(np.random.randn(60,1), index= ix, columns=['Input'])
get_values = df.index.get_level_values
mask = (get_values('ProjectionPeriod') - get_values('SimulationStart')) <= 1
one_year_ahead = df.loc[mask].groupby(level=['SimulationStart', 'Group']).sum()
print(one_year_ahead)
产量
Input
SimulationStart Group
2015 A 0.821851
B -0.643342
C -0.140112
2016 A 0.384885
B -0.252186
C -1.057493
2017 A -1.055933
B 1.096221
C -4.150002
2018 A 0.584859
B -4.062078
C 1.225105
答案 1 :(得分:3)
这是一种方法。
- name: Install EPEL repo.
yum:
name: "{{ epel_repo_url }}"
state: present
register: result
until: '"failed" not in result'
retries: 5
delay: 10
因为您在df.reset_index().query('ProjectionPeriod - SimulationStart == 1') \
.groupby(['SimulationStart', 'Group']).Input.sum()
SimulationStart Group
2015 A 1.100246
B -0.605710
C 1.366465
2016 A 0.359406
B -2.077444
C -0.004356
2017 A 0.604497
B -0.362941
C 0.103945
2018 A -0.861976
B -0.737274
C 0.237512
Name: Input, dtype: float64
列中有唯一的值,这也有效,但我不相信它的内容。
Group