如何在大熊猫分组中获得正值平均值?
MWE:
import numpy as np
import pandas as pd
flights = pd.read_csv('https://github.com/bhishanpdl/Datasets/blob/master/nycflights13.csv?raw=true')
print(flights.shape)
print(flights.iloc[:2,:4])
print()
not_cancelled = flights.dropna(subset=['dep_delay','arr_delay'])
df = (not_cancelled.groupby(['year','month','day'])['arr_delay']
.mean().reset_index()
)
df['avg_delay2'] = df[df.arr_delay>0]['arr_delay'].mean()
print(df.head())
这将所有avg_delay2值都设为16.66。
(336776,19)
年月日dep_time
0 2013 1 1 517.0
1 2013 1 1 533.0
年月日arr_delay avg_delay2
0 2013 1 1 12.651023 16.665681
1 2013 1 2 12.692888 16.665681
2 2013 1 3 5.733333 16.665681
3 2013 1 4 -1.932819 16.665681
4 2013 1 5 -1.525802 16.665681
那是错误的。
# sanity check
a = not_cancelled.query(""" year==2013 & month ==1 & day ==1 """)['arr_delay']
a = a[a>0]
a.mean() # 32.48156182212581
当我在R中执行相同的操作时:
library(nycflights13)
not_cancelled = flights %>%
filter( !is.na(dep_delay), !is.na(arr_delay))
df = not_cancelled %>%
group_by(year,month,day) %>%
summarize(
# average delay
avg_delay1 = mean(arr_delay),
# average positive delay
avg_delay2 = mean(arr_delay[arr_delay>0]))
head(df)
它为avg_delay2提供了正确的输出。
年月日avg_delay1 avg_delay2
2013 1 1 12.651023 32.48156
2013 1 2 12.692888 32.02991
2013 1 3 5.733333 27.66087
2013 1 4 -1.932819 28.30976
2013 1 5 -1.525802 22.55882
2013 1 6 4.236429 24.37270
如何在熊猫中做到这一点?
答案 0 :(得分:3)
我会过滤groupby
之前的肯定值
df = (not_cancelled[not_cancelled.arr_delay >0].groupby(['year','month','day'])['arr_delay']
.mean().reset_index()
)
df.head()
因为,就像您的代码中一样,df
是一个单独的数据帧,在 groupby
操作完成之后,并且
df['avg_delay2'] = df[df.arr_delay>0]['arr_delay'].mean()
将相同的值分配给df['avg_delay2']
编辑:与R相似,您可以使用agg
一次完成两个操作:
def mean_pos(x):
return x[x>0].mean()
df = (not_cancelled.groupby(['year','month','day'])['arr_delay']
.agg({'arr_delay': 'mean', 'arr_delay_2': mean_pos})
)
df.head()
答案 1 :(得分:0)
请注意,从pandas 23开始,不建议在gropby agg中使用Dictionary,并且将来会删除它,因此我们不能使用该方法。
df = (not_cancelled.groupby(['year','month','day'])['arr_delay']
.agg({'arr_delay': 'mean', 'arr_delay_2': mean_pos})
)
FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version.
因此,在这种情况下,我想解决这个问题,我想到了另一个想法。
创建一个新列,使所有非正值变为nans,然后执行通常的groupby。
import numpy as np
import pandas as pd
# read data
flights = pd.read_csv('https://github.com/bhishanpdl/Datasets/blob/master/nycflights13.csv?raw=true')
# select flights that are not cancelled
df = flights.dropna(subset=['dep_delay','arr_delay'])
# create new column to fill non-positive with nans
df['arr_delay_pos'] = df['arr_delay']
df.loc[df.arr_delay_pos <= 0,'arr_delay_pos'] = np.nan
df.groupby(['year','month','day'])[['arr_delay','arr_delay_pos']].mean().reset_index().head()
它给出:
year month day arr_delay arr_delay_positive
0 2013 1 1 12.651023 32.481562
1 2013 1 2 12.692888 32.029907
2 2013 1 3 5.733333 27.660870
3 2013 1 4 -1.932819 28.309764
4 2013 1 5 -1.525802 22.558824
# sanity check
a = not_cancelled.query(""" year==2013 & month ==1 & day ==1 """)['arr_delay']
a = a[a>0]
a.mean() # 32.48156182212581