新DataFrame是每天的平均值

时间:2017-12-15 15:42:06

标签: python pandas dataframe average

    Site    Parameter   Date (LST)  Year    Month   Day Hour    Value   Unit    Duration    QC Name
1   Beijing PM2.5   2017-01-01 00:00:00 2017    1   1   0   505 µg/m_   1 Hr    Valid
2   Beijing PM2.5   2017-01-01 01:00:00 2017    1   1   1   485 µg/m_   1 Hr    Valid
3   Beijing PM2.5   2017-01-01 02:00:00 2017    1   1   2   466 µg/m_   1 Hr    Valid
4   Beijing PM2.5   2017-01-01 03:00:00 2017    1   1   3   435 µg/m_   1 Hr    Valid
5   Beijing PM2.5   2017-01-01 04:00:00 2017    1   1   4   405 µg/m_   1 Hr    Valid
6   Beijing PM2.5   2017-01-01 05:00:00 2017    1   1   5   402 µg/m_   1 Hr    Valid
7   Beijing PM2.5   2017-01-01 06:00:00 2017    1   1   6   407 µg/m_   1 Hr    Valid
8   Beijing PM2.5   2017-01-01 07:00:00 2017    1   1   7   435 µg/m_   1 Hr    Valid
9   Beijing PM2.5   2017-01-01 08:00:00 2017    1   1   8   472 µg/m_   1 Hr    Valid
10  Beijing PM2.5   2017-01-01 09:00:00 2017    1   1   9   465 µg/m_   1 Hr    Valid
11  Beijing PM2.5   2017-01-01 10:00:00 2017    1   1   10  473 µg/m_   1 Hr    Valid
12  Beijing PM2.5   2017-01-01 11:00:00 2017    1   1   11  456 µg/m_   1 Hr    Valid
13  Beijing PM2.5   2017-01-01 12:00:00 2017    1   1   12  474 µg/m_   1 Hr    Valid
14  Beijing PM2.5   2017-01-01 13:00:00 2017    1   1   13  510 µg/m_   1 Hr    Valid
15  Beijing PM2.5   2017-01-01 14:00:00 2017    1   1   14  596 µg/m_   1 Hr    Valid
16  Beijing PM2.5   2017-01-01 15:00:00 2017    1   1   15  580 µg/m_   1 Hr    Valid
17  Beijing PM2.5   2017-01-01 16:00:00 2017    1   1   16  556 µg/m_   1 Hr    Valid
18  Beijing PM2.5   2017-01-01 17:00:00 2017    1   1   17  522 µg/m_   1 Hr    Valid
19  Beijing PM2.5   2017-01-01 18:00:00 2017    1   1   18  495 µg/m_   1 Hr    Valid
20  Beijing PM2.5   2017-01-01 19:00:00 2017    1   1   19  500 µg/m_   1 Hr    Valid
21  Beijing PM2.5   2017-01-01 20:00:00 2017    1   1   20  484 µg/m_   1 Hr    Valid
22  Beijing PM2.5   2017-01-01 21:00:00 2017    1   1   21  452 µg/m_   1 Hr    Valid
23  Beijing PM2.5   2017-01-01 22:00:00 2017    1   1   22  427 µg/m_   1 Hr    Valid
24  Beijing PM2.5   2017-01-01 23:00:00 2017    1   1   23  444 µg/m_   1 Hr    Valid
25  Beijing PM2.5   2017-01-02 00:00:00 2017    1   2   0   428 µg/m_   1 Hr    Valid
26  Beijing PM2.5   2017-01-02 01:00:00 2017    1   2   1   466 µg/m_   1 Hr    Valid
27  Beijing PM2.5   2017-01-02 02:00:00 2017    1   2   2   452 µg/m_   1 Hr    Valid
28  Beijing PM2.5   2017-01-02 03:00:00 2017    1   2   3   442 µg/m_   1 Hr    Valid
29  Beijing PM2.5   2017-01-02 04:00:00 2017    1   2   4   390 µg/m_   1 Hr    Valid
30  Beijing PM2.5   2017-01-02 05:00:00 2017    1   2   5   317 µg/m_   1 Hr    Valid

如何从显示的所有列中显示的那个(截断的)创建一个新的DataFrame,但是不是按小时显示值,而是显示当天的平均值?

3 个答案:

答案 0 :(得分:1)

您可以尝试:

import datetime from datetime

df['Dates'] = df['Date (LST)'].dt.date

df['hour_average'] = df.groupby(['Dates'])['Hour'].transform('mean')

答案 1 :(得分:1)

这是一个非常基本的split-apply-combine problem。但是,作为环境数据,我可以帮助您解决一些细微差别。

据推测,您的完整数据集在多个站点上测量了多个参数,因此您需要按这些参数进行分组。由于您的日期已经解析为其组件,我们可能会使用它们来获取每日值。

作为每天使用此类环境数据的人,您也总是希望按单位分组。虽然单位在此数据集中是一致的,但您最终会遇到具有一致单位的数据集。养成在小组中加入单位的习惯可以帮助你发现这些错误。

让我们读一下您的数据:

from io import StringIO
import pandas

datafile = StringIO("""\
Site    Parameter   "Date (LST)"  Year    Month   Day Hour    Value   Unit    Duration    QC Name
Beijing PM2.5   "2017-01-01 00:00:00" 2017    1   1   0   505 µg/m_   1 Hr    Valid
Beijing PM2.5   "2017-01-01 01:00:00" 2017    1   1   1   485 µg/m_   1 Hr    Valid
Beijing PM2.5   "2017-01-01 02:00:00" 2017    1   1   2   466 µg/m_   1 Hr    Valid
Beijing PM2.5   "2017-01-01 03:00:00" 2017    1   1   3   435 µg/m_   1 Hr    Valid
Beijing PM2.5   "2017-01-01 04:00:00" 2017    1   1   4   405 µg/m_   1 Hr    Valid
Beijing PM2.5   "2017-01-01 05:00:00" 2017    1   1   5   402 µg/m_   1 Hr    Valid
Beijing PM2.5   "2017-01-01 06:00:00" 2017    1   1   6   407 µg/m_   1 Hr    Valid
Beijing PM2.5   "2017-01-01 07:00:00" 2017    1   1   7   435 µg/m_   1 Hr    Valid
Beijing PM2.5   "2017-01-01 08:00:00" 2017    1   1   8   472 µg/m_   1 Hr    Valid
Beijing PM2.5   "2017-01-01 09:00:00" 2017    1   1   9   465 µg/m_   1 Hr    Valid
Beijing PM2.5   "2017-01-01 10:00:00" 2017    1   1   10  473 µg/m_   1 Hr    Valid
Beijing PM2.5   "2017-01-01 11:00:00" 2017    1   1   11  456 µg/m_   1 Hr    Valid
Beijing PM2.5   "2017-01-01 12:00:00" 2017    1   1   12  474 µg/m_   1 Hr    Valid
Beijing PM2.5   "2017-01-01 13:00:00" 2017    1   1   13  510 µg/m_   1 Hr    Valid
Beijing PM2.5   "2017-01-01 14:00:00" 2017    1   1   14  596 µg/m_   1 Hr    Valid
Beijing PM2.5   "2017-01-01 15:00:00" 2017    1   1   15  580 µg/m_   1 Hr    Valid
Beijing PM2.5   "2017-01-01 16:00:00" 2017    1   1   16  556 µg/m_   1 Hr    Valid
Beijing PM2.5   "2017-01-01 17:00:00" 2017    1   1   17  522 µg/m_   1 Hr    Valid
Beijing PM2.5   "2017-01-01 18:00:00" 2017    1   1   18  495 µg/m_   1 Hr    Valid
Beijing PM2.5   "2017-01-01 19:00:00" 2017    1   1   19  500 µg/m_   1 Hr    Valid
Beijing PM2.5   "2017-01-01 20:00:00" 2017    1   1   20  484 µg/m_   1 Hr    Valid
Beijing PM2.5   "2017-01-01 21:00:00" 2017    1   1   21  452 µg/m_   1 Hr    Valid
Beijing PM2.5   "2017-01-01 22:00:00" 2017    1   1   22  427 µg/m_   1 Hr    Valid
Beijing PM2.5   "2017-01-01 23:00:00" 2017    1   1   23  444 µg/m_   1 Hr    Valid
Beijing PM2.5   "2017-01-02 00:00:00" 2017    1   2   0   428 µg/m_   1 Hr    Valid
Beijing PM2.5   "2017-01-02 01:00:00" 2017    1   2   1   466 µg/m_   1 Hr    Valid
Beijing PM2.5   "2017-01-02 02:00:00" 2017    1   2   2   452 µg/m_   1 Hr    Valid
Beijing PM2.5   "2017-01-02 03:00:00" 2017    1   2   3   442 µg/m_   1 Hr    Valid
Beijing PM2.5   "2017-01-02 04:00:00" 2017    1   2   4   390 µg/m_   1 Hr    Valid
Beijing PM2.5   "2017-01-02 05:00:00" 2017    1   2   5   317 µg/m_   1 Hr    Valid
""")

df = pandas.read_csv(datafile, sep='\s+', parse_dates=['Date (LST)'])

然后按定义site-parameter-unit-day的所有列进行分组,选择“Value”列,然后取平均值。

group_cols = ['Site', 'Parameter', 'Unit', 'Year', 'Month', 'Day']
df.groupby(by=group_cols)['Value'].mean()

这就是:

Site     Parameter  Unit   Year  Month  Day
Beijing  PM2.5      µg/m_  2017  1      1      476.916667
                                        2      415.833333

在group by语句中包含site,parameter和units意味着上面的简单语句可以扩展到包含任意数量的站点和参数的数据集。

答案 2 :(得分:-1)

我相信您正在寻找pandas.DataFrame.mean().

使用示例:

import pandas as pd

data = ["Beijing","PM2.5","2017-01-01","2017","1",df["Value"].mean(), 'ug/m_', '1 Day', 'Valid']

averages = pd.DataFrame(data, columns=["Site", "Parameter", "Date", "Year", "Month", "Day", "Value", "Unit", "Duration", "QC Name"])

请记住,我根据您获取信息的方式对值进行了硬编码,可能有更好的方法来导入标头和值。但是这应该显示如何使用df.mean()