groupby(多列)熊猫的标准偏差

时间:2020-08-08 23:02:57

标签: pandas pandas-groupby stdev

我正在使用加利福尼亚州空气资源委员会的数据。

site,monitor,date,start_hour,value,variable,units,quality,prelim,name 
5407,t,2014-01-01,0,3.00,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach 
5407,t,2014-01-01,1,1.54,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach 
5407,t,2014-01-01,2,3.76,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach 
5407,t,2014-01-01,3,5.98,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach 
5407,t,2014-01-01,4,8.09,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach 
5407,t,2014-01-01,5,12.05,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach 
5407,t,2014-01-01,6,12.55,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach 
...

df = pd.concat([pd.read_csv(file, header = 0) for file in f]) #merges all files into one dataframe
df.dropna(axis = 0, how = "all", subset = ['start_hour', 'variable'],
          inplace = True) #drops bottom columns without data in them, NaN

df.start_hour = pd.to_timedelta(df['start_hour'], unit = 'h')
df.date = pd.to_datetime(df.date)
df['datetime'] = df.date + df.start_hour
df.drop(columns=['date', 'start_hour'], inplace=True)
df['month'] = df.datetime.dt.month
df['day'] = df.datetime.dt.day
df['year'] = df.datetime.dt.year
df.set_index('datetime', inplace = True)
df =  df.rename(columns={'value':'conc'})

我有多年的每小时PM2.5浓度数据,并试图制作显示多年平均每月浓度的图表(每个月的不同图表)。这是到目前为止我创建的图形的图像。 [![Bombay Beach] [1]] [1]但是,我想向平均浓度线添加误差线,但是在尝试计算标准偏差时遇到问题。我创建了一个新的数据框d_avg,其中包括PM2.5的年,月,日和平均浓度。这是一些数据。

d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()
   year  month  day      conc
0  2014      1    1  9.644583
1  2014      1    2  4.945652
2  2014      1    3  4.345238
3  2014      1    4  5.047917
4  2014      1    5  5.212857
5  2014      1    6  2.095714

此后,我找到了每月平均值m_avg,并创建了一个日期时间索引来绘制日期时间与每月平均浓度(请参阅上面的黑线)。

m_avg = d_avg.groupby(['year','month'], as_index=False)['conc'].mean()
m_avg['datetime'] = pd.to_datetime(m_avg.year.astype(str) + m_avg.month.astype(str), format='%Y%m') + MonthEnd(1)
[In]: m_avg.head(6)
[Out]:
   year  month      conc   datetime
0  2014      1  4.330985 2014-01-31
1  2014      2  2.280096 2014-02-28
2  2014      3  4.464622 2014-03-31
3  2014      4  6.583759 2014-04-30
4  2014      5  9.069353 2014-05-31
5  2014      6  9.982330 2014-06-30

现在,我想计算d_avg浓度的标准偏差,并且我已经尝试了多种方法:

sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std()

sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].agg(np.std)

sd = d_avg['conc'].apply(lambda x: x.std())

但是,每次尝试都使我在数据框中出现相同的错误。我无法绘制标准偏差,因为我认为它也采用了年和月的标准偏差,因此我试图对数据进行分组。这就是我生成的数据框sd的样子:

        year     month        sd
0  44.877611  1.000000  1.795868
1  44.877611  1.414214  2.355055
2  44.877611  1.732051  2.597531
3  44.877611  2.000000  2.538749
4  44.877611  2.236068  5.456785
5  44.877611  2.449490  3.315546

请帮助我! [1]:https://i.stack.imgur.com/ueVrG.png

2 个答案:

答案 0 :(得分:0)

我试图重现您的错误,对我来说很好。这是我完整的代码示例,与用于生成原始数据帧的EXCEPT几乎完全相同。所以我怀疑代码的那部分。您可以提供创建数据框的代码吗?

import pandas as pd

columns = ['year', 'month', 'day', 'conc']
data = [[2014, 1, 1, 2.0],
        [2014, 1, 1, 4.0],
        [2014, 1, 2, 6.0],
        [2014, 1, 2, 8.0],
        [2014, 2, 1, 2.0],
        [2014, 2, 1, 6.0],
        [2014, 2, 2, 10.0],
        [2014, 2, 2, 14.0]]

df = pd.DataFrame(data, columns=columns)
d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()
m_avg = d_avg.groupby(['year', 'month'], as_index=False)['conc'].mean()
m_std = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std()

print(f'Concentrations:\n{df}\n')
print(f'Daily Average:\n{d_avg}\n')
print(f'Monthly Average:\n{m_avg}\n')
print(f'Standard Deviation:\n{m_std}\n')

输出:

Concentrations:
   year  month  day  conc
0  2014      1    1   2.0
1  2014      1    1   4.0
2  2014      1    2   6.0
3  2014      1    2   8.0
4  2014      2    1   2.0
5  2014      2    1   6.0
6  2014      2    2  10.0
7  2014      2    2  14.0

Daily Average:
   year  month  day  conc
0  2014      1    1   3.0
1  2014      1    2   7.0
2  2014      2    1   4.0
3  2014      2    2  12.0

Monthly Average:
   year  month  conc
0  2014      1   5.0
1  2014      2   8.0

Monthly Standard Deviation:
   year  month      conc
0  2014      1  2.828427
1  2014      2  5.656854

答案 1 :(得分:0)

我决定绕着我的问题跳舞,因为我不知道是什么引起了问题。我合并了m_avg和sd数据框,并删除了引起问题的年和月列。请参阅下面的代码,进行大量重命名。

d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()
m_avg = d_avg.groupby(['year','month'], as_index=False)['conc'].mean()
sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std(ddof=0) 
sd = sd.rename(columns={"conc":"sd", "year":"wrongyr", "month":"wrongmth"})
m_avg_sd = pd.concat([m_avg, sd], axis = 1)
m_avg_sd.drop(columns=['wrongyr', 'wrongmth'], inplace = True)
m_avg_sd['datetime'] = pd.to_datetime(m_avg_sd.year.astype(str) + m_avg_sd.month.astype(str), format='%Y%m') + MonthEnd(1)

这是新的数据框:

m_avg_sd.head(5)
Out[2]: 
   year  month       conc         sd   datetime
0  2009      1  48.350105  18.394192 2009-01-31
1  2009      2  21.929383  16.293645 2009-02-28
2  2009      3  15.094729   6.821124 2009-03-31
3  2009      4  12.021009   4.391219 2009-04-30
4  2009      5  13.449100   4.081734 2009-05-31
相关问题