我正在使用加利福尼亚州空气资源委员会的数据。
site,monitor,date,start_hour,value,variable,units,quality,prelim,name
5407,t,2014-01-01,0,3.00,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,1,1.54,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,2,3.76,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,3,5.98,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,4,8.09,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,5,12.05,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,6,12.55,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
...
df = pd.concat([pd.read_csv(file, header = 0) for file in f]) #merges all files into one dataframe
df.dropna(axis = 0, how = "all", subset = ['start_hour', 'variable'],
inplace = True) #drops bottom columns without data in them, NaN
df.start_hour = pd.to_timedelta(df['start_hour'], unit = 'h')
df.date = pd.to_datetime(df.date)
df['datetime'] = df.date + df.start_hour
df.drop(columns=['date', 'start_hour'], inplace=True)
df['month'] = df.datetime.dt.month
df['day'] = df.datetime.dt.day
df['year'] = df.datetime.dt.year
df.set_index('datetime', inplace = True)
df = df.rename(columns={'value':'conc'})
我有多年的每小时PM2.5浓度数据,并试图制作显示多年平均每月浓度的图表(每个月的不同图表)。这是到目前为止我创建的图形的图像。 [![Bombay Beach] [1]] [1]但是,我想向平均浓度线添加误差线,但是在尝试计算标准偏差时遇到问题。我创建了一个新的数据框d_avg
,其中包括PM2.5的年,月,日和平均浓度。这是一些数据。
d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()
year month day conc
0 2014 1 1 9.644583
1 2014 1 2 4.945652
2 2014 1 3 4.345238
3 2014 1 4 5.047917
4 2014 1 5 5.212857
5 2014 1 6 2.095714
此后,我找到了每月平均值m_avg
,并创建了一个日期时间索引来绘制日期时间与每月平均浓度(请参阅上面的黑线)。
m_avg = d_avg.groupby(['year','month'], as_index=False)['conc'].mean()
m_avg['datetime'] = pd.to_datetime(m_avg.year.astype(str) + m_avg.month.astype(str), format='%Y%m') + MonthEnd(1)
[In]: m_avg.head(6)
[Out]:
year month conc datetime
0 2014 1 4.330985 2014-01-31
1 2014 2 2.280096 2014-02-28
2 2014 3 4.464622 2014-03-31
3 2014 4 6.583759 2014-04-30
4 2014 5 9.069353 2014-05-31
5 2014 6 9.982330 2014-06-30
现在,我想计算d_avg
浓度的标准偏差,并且我已经尝试了多种方法:
sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std()
sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].agg(np.std)
sd = d_avg['conc'].apply(lambda x: x.std())
但是,每次尝试都使我在数据框中出现相同的错误。我无法绘制标准偏差,因为我认为它也采用了年和月的标准偏差,因此我试图对数据进行分组。这就是我生成的数据框sd
的样子:
year month sd
0 44.877611 1.000000 1.795868
1 44.877611 1.414214 2.355055
2 44.877611 1.732051 2.597531
3 44.877611 2.000000 2.538749
4 44.877611 2.236068 5.456785
5 44.877611 2.449490 3.315546
请帮助我! [1]:https://i.stack.imgur.com/ueVrG.png
答案 0 :(得分:0)
我试图重现您的错误,对我来说很好。这是我完整的代码示例,与用于生成原始数据帧的EXCEPT几乎完全相同。所以我怀疑代码的那部分。您可以提供创建数据框的代码吗?
import pandas as pd
columns = ['year', 'month', 'day', 'conc']
data = [[2014, 1, 1, 2.0],
[2014, 1, 1, 4.0],
[2014, 1, 2, 6.0],
[2014, 1, 2, 8.0],
[2014, 2, 1, 2.0],
[2014, 2, 1, 6.0],
[2014, 2, 2, 10.0],
[2014, 2, 2, 14.0]]
df = pd.DataFrame(data, columns=columns)
d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()
m_avg = d_avg.groupby(['year', 'month'], as_index=False)['conc'].mean()
m_std = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std()
print(f'Concentrations:\n{df}\n')
print(f'Daily Average:\n{d_avg}\n')
print(f'Monthly Average:\n{m_avg}\n')
print(f'Standard Deviation:\n{m_std}\n')
输出:
Concentrations:
year month day conc
0 2014 1 1 2.0
1 2014 1 1 4.0
2 2014 1 2 6.0
3 2014 1 2 8.0
4 2014 2 1 2.0
5 2014 2 1 6.0
6 2014 2 2 10.0
7 2014 2 2 14.0
Daily Average:
year month day conc
0 2014 1 1 3.0
1 2014 1 2 7.0
2 2014 2 1 4.0
3 2014 2 2 12.0
Monthly Average:
year month conc
0 2014 1 5.0
1 2014 2 8.0
Monthly Standard Deviation:
year month conc
0 2014 1 2.828427
1 2014 2 5.656854
答案 1 :(得分:0)
我决定绕着我的问题跳舞,因为我不知道是什么引起了问题。我合并了m_avg和sd数据框,并删除了引起问题的年和月列。请参阅下面的代码,进行大量重命名。
d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()
m_avg = d_avg.groupby(['year','month'], as_index=False)['conc'].mean()
sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std(ddof=0)
sd = sd.rename(columns={"conc":"sd", "year":"wrongyr", "month":"wrongmth"})
m_avg_sd = pd.concat([m_avg, sd], axis = 1)
m_avg_sd.drop(columns=['wrongyr', 'wrongmth'], inplace = True)
m_avg_sd['datetime'] = pd.to_datetime(m_avg_sd.year.astype(str) + m_avg_sd.month.astype(str), format='%Y%m') + MonthEnd(1)
这是新的数据框:
m_avg_sd.head(5)
Out[2]:
year month conc sd datetime
0 2009 1 48.350105 18.394192 2009-01-31
1 2009 2 21.929383 16.293645 2009-02-28
2 2009 3 15.094729 6.821124 2009-03-31
3 2009 4 12.021009 4.391219 2009-04-30
4 2009 5 13.449100 4.081734 2009-05-31