mda8

Question

我有11年的每小时臭氧浓度数据。
- 每天每天都有11个包含臭氧浓度的csv文件。
我能够读取其中的所有文件，并将索引从日期转换为日期时间。
对于我的图表：
- 我计算了每天的最大8小时平均值，然后将每个月的平均值取平均值。
我的新数据框（df3）具有：
- 日期时间索引，它由12年中每年的每个月的最后一天组成。
- 它的一列还包含平均MDA8值。
我想分别绘制4月，5月和6月的3个散点图。（x轴=年，y轴=该月的平均MDA8）
- 但是，我在如何称呼这些单独的月份并绘制年度数据方面陷入困境。

最少样本

site,date,start_hour,value,variable,units,quality,prelim,name 
3135,2010-01-01,0,13.0,OZONE,Parts Per Billion ( ppb ),,,Calexico-Ethel Street
3135,2010-01-01,1,5.0,OZONE,Parts Per Billion ( ppb ),,,Calexico-Ethel Street
3135,2010-01-01,2,11.0,OZONE,Parts Per Billion ( ppb ),,,Calexico-Ethel Street
3135,2010-01-01,3,17.0,OZONE,Parts Per Billion ( ppb ),,,Calexico-Ethel Street
3135,2010-01-01,5,16.0,OZONE,Parts Per Billion ( ppb ),,,Calexico-Ethel Street

这里是找到相似的CSV数据https://www.arb.ca.gov/aqmis2/aqdselect.php?tab=hourly

的链接

我在下面附加了一些代码：

import pandas as pd
import os
import glob
import matplotlib.pyplot as plt

path = "C:/Users/blah"
for f in glob.glob(os.path.join(path, "*.csv")):
    df = pd.read_csv(f, header = 0, index_col='date')
    df2 = df.dropna(axis = 0, how = "all", subset = ['start_hour', 'variable'], inplace = True) 
    df = df.iloc[0:]
    df.index = pd.to_datetime(df.index) #converting date to datetime
    df['start_hour'] = pd.to_timedelta(df['start_hour'], unit = 'h')
    df['datetime'] = df.index + df['start_hour']
    df.set_index('datetime', inplace = True)

    df2 = df.value.rolling('8H', min_periods = 6).mean() 
    df2.index -= pd.DateOffset(hours=3)
    df2 = df4.resample('D').max()
    df2.index.name = 'timestamp'

问题在下面发生：

    df3 = df2.groupby(pd.Grouper(freq = 'M')).mean()
    df4 = df3[df3.index.month.isin([4,5,6])]
    if df4 == True:
        plt.plot(df3.index, df3.values)
    print(df4)

每次执行此操作时，都会收到一条消息，提示“ ValueError：系列的真值不明确。请使用a.empty，a.bool（），a.item（），a.any（）或a。所有（）。” 当我使用df4.any() == True:尝试此代码时，它将绘制除4月至6月以外的所有月份，并绘制同一图中的所有值。我希望每个月有不同的地块。

我还尝试添加以下内容并删除前面的if语句：

df5 = df4.index.year.isin([2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019])
    if df5.all() == True:
        plt.plot(df4.index, df4.values)

但是，这给了我类似的图像：
Apr-Jun MDA8 values

同样，我想每月制作一个单独的散点图，尽管这与我想要的更接近。任何帮助将不胜感激，谢谢。

编辑另外，我有2020年的数据，仅延伸至7月。我认为这不会影响我的图表，但我只想提一下。理想情况下，我希望它看起来像这样，但是每年和每个四月的月份都不同。

Scatterplot

Answer 1

df.index -= pd.DateOffset(hours=3)已被删除，因为可能存在问题
- 每个月的前几个小时在上个月
- 每天的第一个小时在前一天

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from datetime import date
from pandas.tseries.offsets import MonthEnd

# set the path to the files
p = Path('/PythonProjects/stack_overflow/data/ozone/')

# list of files
files = list(p.glob('OZONE*.csv'))

# create a dataframe from the files - all years all data
df = pd.concat([pd.read_csv(file) for file in files])

# format the dataframe
df.start_hour = pd.to_timedelta(df['start_hour'], unit = 'h')
df.date = pd.to_datetime(df.date)
df['datetime'] = df.date + df.start_hour
df.drop(columns=['date', 'start_hour'], inplace=True)
df['month'] = df.datetime.dt.month
df['day'] = df.datetime.dt.day
df['year'] = df.datetime.dt.year
df = df[df.month.isin([4, 5, 6])].copy()  # filter the dataframe - only April, May, June
df.set_index('datetime', inplace = True)

# calculate the 8-hour rolling mean
df['r_mean'] = df.value.rolling('8H', min_periods=6).mean()

# determine max value per day
r_mean_daily_max = df.groupby(['year', 'month', 'day'], as_index=False)['r_mean'].max()

# calculate the mean from the daily max
mda8 = r_mean_daily_max.groupby(['year', 'month'], as_index=False)['r_mean'].mean()

# add a new datetime column with the date as the end of the month
mda8['datetime'] = pd.to_datetime(mda8.year.astype(str) + mda8.month.astype(str), format='%Y%m') + MonthEnd(1)

`df.info()`和`.head()`在进行任何处理之前

<class 'pandas.core.frame.DataFrame'>
Int64Index: 78204 entries, 0 to 4663
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   site        78204 non-null  int64  
 1   date        78204 non-null  object 
 2   start_hour  78204 non-null  int64  
 3   value       78204 non-null  float64
 4   variable    78204 non-null  object 
 5   units       78204 non-null  object 
 6   quality     4664 non-null   float64
 7   prelim      4664 non-null   object 
 8   name        78204 non-null  object 
dtypes: float64(2), int64(2), object(5)
memory usage: 6.0+ MB

   site        date  start_hour  value variable                      units  quality prelim                   name 
0  3135  2011-01-01           0   14.0    OZONE  Parts Per Billion ( ppb )      NaN    NaN  Calexico-Ethel Street 
1  3135  2011-01-01           1   11.0    OZONE  Parts Per Billion ( ppb )      NaN    NaN  Calexico-Ethel Street 
2  3135  2011-01-01           2   22.0    OZONE  Parts Per Billion ( ppb )      NaN    NaN  Calexico-Ethel Street 
3  3135  2011-01-01           3   25.0    OZONE  Parts Per Billion ( ppb )      NaN    NaN  Calexico-Ethel Street 
4  3135  2011-01-01           5   22.0    OZONE  Parts Per Billion ( ppb )      NaN    NaN  Calexico-Ethel Street

`df.info`和`.head()`处理后

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 20708 entries, 2011-04-01 00:00:00 to 2020-06-30 23:00:00
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   site      20708 non-null  int64  
 1   value     20708 non-null  float64
 2   variable  20708 non-null  object 
 3   units     20708 non-null  object 
 4   quality   2086 non-null   float64
 5   prelim    2086 non-null   object 
 6   name      20708 non-null  object 
 7   month     20708 non-null  int64  
 8   day       20708 non-null  int64  
 9   year      20708 non-null  int64  
 10  r_mean    20475 non-null  float64
dtypes: float64(3), int64(4), object(4)
memory usage: 1.9+ MB

                     site  value variable                      units  quality prelim                   name   month  day  year  r_mean
datetime                                                                                                                              
2011-04-01 00:00:00  3135   13.0    OZONE  Parts Per Billion ( ppb )      NaN    NaN  Calexico-Ethel Street       4    1  2011     NaN
2011-04-01 01:00:00  3135   29.0    OZONE  Parts Per Billion ( ppb )      NaN    NaN  Calexico-Ethel Street       4    1  2011     NaN
2011-04-01 02:00:00  3135   31.0    OZONE  Parts Per Billion ( ppb )      NaN    NaN  Calexico-Ethel Street       4    1  2011     NaN
2011-04-01 03:00:00  3135   28.0    OZONE  Parts Per Billion ( ppb )      NaN    NaN  Calexico-Ethel Street       4    1  2011     NaN
2011-04-01 05:00:00  3135   11.0    OZONE  Parts Per Billion ( ppb )      NaN    NaN  Calexico-Ethel Street       4    1  2011     NaN

`r_mean_daily_max.info()`和`.head()`

<class 'pandas.core.frame.DataFrame'>
Int64Index: 910 entries, 0 to 909
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   year    910 non-null    int64  
 1   month   910 non-null    int64  
 2   day     910 non-null    int64  
 3   r_mean  910 non-null    float64
dtypes: float64(1), int64(3)
memory usage: 35.5 KB

   year  month  day  r_mean
0  2011      4    1  44.125
1  2011      4    2  43.500
2  2011      4    3  42.000
3  2011      4    4  49.625
4  2011      4    5  45.500

`mda8.info()`和`.head()`

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30 entries, 0 to 29
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   year      30 non-null     int64         
 1   month     30 non-null     int64         
 2   r_mean    30 non-null     float64       
 3   datetime  30 non-null     datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(2)
memory usage: 1.2 KB

   year  month     r_mean   datetime
0  2011      4  49.808135 2011-04-30
1  2011      5  55.225806 2011-05-31
2  2011      6  58.162302 2011-06-30
3  2012      4  45.865278 2012-04-30
4  2012      5  61.061828 2012-05-31

mda8

情节1

sns.lineplot(mda8.datetime, mda8.r_mean, marker='o')
plt.xlim(date(2011, 1, 1), date(2021, 1, 1))

情节2

# create color mapping based on all unique values of year
years = mda8.year.unique()
colors = sns.color_palette('husl', n_colors=len(years))  # get a number of colors
cmap = dict(zip(years, colors))  # zip values to colors

for g, d in mda8.groupby('year'):
    sns.lineplot(d.datetime, d.r_mean, marker='o', hue=g, palette=cmap)
    
plt.xlim(date(2011, 1, 1), date(2021, 1, 1))
plt.legend(bbox_to_anchor=(1.04,0.5), loc="center left", borderaxespad=0)

情节3

sns.barplot(x='month', y='r_mean', data=mda8, hue='year')
plt.legend(bbox_to_anchor=(1.04,0.5), loc="center left", borderaxespad=0)
plt.title('MDA8: April - June')
plt.ylabel('mda8 (ppb)')
plt.show()

情节4

for month in mda8.month.unique():
    data = mda8[mda8.month == month]  # filter and plot the data for a specific month
    plt.figure()  # create a new figure for each month
    sns.lineplot(data.datetime, data.r_mean, marker='o')
    plt.xlim(date(2011, 1, 1), date(2021, 1, 1))
    plt.title(f'Month: {month}')
    plt.ylabel('MDA8: PPB')
    plt.xlabel('Year')

每月将有一个地块

图5

for month in mda8.month.unique():
    data = mda8[mda8.month == month]
    sns.lineplot(data.datetime, data.r_mean, marker='o', label=month)
    plt.legend(title='Month')
    plt.xlim(date(2011, 1, 1), date(2021, 1, 1))
    plt.ylabel('MDA8: PPB')
    plt.xlabel('Year')

解决我想分别绘制4月，5月和6月的3个散点图。
主要问题是，无法使用日期时间轴绘制数据。
- 目标是在轴上绘制每天，每个数字都表示不同的月份。

Lineplot

有点忙
使用了自定义颜色图，因为标准调色板中没有足够的颜色来赋予每年唯一的颜色

# create color mapping based on all unique values of year
years = df.index.year.unique()
colors = sns.color_palette('husl', n_colors=len(years))  # get a number of colors
cmap = dict(zip(years, colors))  # zip values to colors

for k, v in df.groupby('month'):  # group the dateframe by month
    plt.figure(figsize=(16, 10))
    for year in v.index.year.unique():  # withing the month plot each year
        data = v[v.index.year == year]
        sns.lineplot(data.index.day, data.r_mean, err_style=None, hue=year, palette=cmap)
    plt.xlim(0, 33)
    plt.xticks(range(1, 32))
    plt.title(f'Month: {k}')
    plt.xlabel('Day of Month')
    plt.legend(bbox_to_anchor=(1.04,0.5), loc="center left", borderaxespad=0)
plt.show()

这里是四月，其他两个数字与此相似

Barplot

for k, v in df.groupby('month'):  # group the dateframe by month
    plt.figure(figsize=(10, 20))

    sns.barplot(x=v.r_mean, y=v.day, ci=None, orient='h', hue=v.index.year)
    plt.title(f'Month: {k}')
    plt.ylabel('Day of Month')
    plt.legend(bbox_to_anchor=(1.04,0.5), loc="center left", borderaxespad=0)
plt.show()

如何制作图表来绘制大熊猫多年来的月度数据

最少样本

1 个答案:

`df.info()`和`.head()`在进行任何处理之前

`df.info`和`.head()`处理后

`r_mean_daily_max.info()`和`.head()`

`mda8.info()`和`.head()`

mda8

情节1

情节2

情节3

情节4

图5

Lineplot

Barplot

如何制作图表来绘制大熊猫多年来的月度数据

最少样本

1 个答案:

df.info()和.head()在进行任何处理之前

df.info和.head()处理后

r_mean_daily_max.info()和.head()

mda8.info()和.head()

mda8

情节1

情节2

情节3

情节4

图5

Lineplot

Barplot

`df.info()`和`.head()`在进行任何处理之前

`df.info`和`.head()`处理后

`r_mean_daily_max.info()`和`.head()`

`mda8.info()`和`.head()`