如何计算数据集中每个月的平均值?

时间:2019-12-18 11:22:20

标签: python pandas dataframe

示例风数据集:

`.................RPT    VAL    ROS    KIL    SHA    BIR    DUB    CLA    MUL    CLO    BEL    MAL
    DATE
    1961-01-04   10.58  6.63   11.75  4.58   4.54   2.88   8.63   1.79   5.83   5.88   5.46   10.88
    1961-01-05   13.33  13.25  11.42  6.17  10.71   8.21   11.92  6.54  10.92  10.34  12.92   11.83
    1961-01-06   13.21  8.12    9.96  6.67   5.37   4.50   10.67  4.42   7.17   7.50   8.12   13.17
    1961-02-07   13.50  14.29   9.50  4.96  12.29   8.33    9.17  9.29   7.58   7.96   13.96  13.79
    1961-02-08   10.96  9.75    7.62  5.91   9.62   7.29   14.29  7.62   9.25  10.46   16.62  16.46
    1961-03-04   10.58  6.63   11.75  4.58   4.54   2.88   8.63   1.79   5.83   5.88   5.46   10.88
    1962-03-05   13.33  13.25  11.42  6.17  10.71   8.21   11.92  6.54  10.92  10.34  12.92   11.83
    1962-06-06   13.21  8.12    9.96  6.67   5.37   4.50   10.67  4.42   7.17   7.50   8.12   13.17
    1968-07-07   13.50  14.29   9.50  4.96  12.29   8.33    9.17  9.29   7.58   7.96   13.96  13.79
    1968-07-08   10.96  9.75    7.62  5.91   9.62   7.29   14.29  7.62   9.25  10.46   16.62  16.46
    1976-08-04   10.58  6.63   11.75  4.58   4.54   2.88   8.63   1.79   5.83   5.88   5.46   10.88
    1976-08-05   13.33  13.25  11.42  6.17  10.71   8.21   11.92  6.54  10.92  10.34  12.92   11.83
    1978-09-06   13.21  8.12    9.96  6.67   5.37   4.50   10.67  4.42   7.17   7.50   8.12   13.17
    1978-09-07   13.50  14.29   9.50  4.96  12.29   8.33    9.17  9.29   7.58   7.96   13.96  13.79
    1978-12-08   10.96  9.75    7.62  5.91   9.62   7.29   14.29  7.62   9.25  10.46   16.62  16.46`  

完整数据集为here

在此数据集中,列是位置,值是风速。我想计算数据集中每个月的风速。但是我想将1961年1月和1962年1月视为不同的月份。 我试着用for循环来做。首先,我创建了一个列名“ Month”,然后使用如下所示的for循环分配值:

`for i in range(len(data.index)):
    if data.index[i].month == 1:
        if data.index[i].year == 1961:
            data['Month'][i] = 'January 61'
        elif data.index[i].year == 1962:
            data['Month'][i] = 'January 62'
        else:
            data['Month'][i] = 'January'
    elif data.index[i].month == 2:
        data['Month'][i] = 'February'
    elif data.index[i].month == 3:
        data['Month'][i] = 'March'
    elif data.index[i].month == 4:
        data['Month'][i] = 'April'
    elif data.index[i].month == 5:
        data['Month'][i] = 'May'
    elif data.index[i].month == 6:
        data['Month'][i] = 'June'
    elif data.index[i].month == 7:
        data['Month'][i] = 'July'
    elif data.index[i].month == 8:
        data['Month'][i] = 'August'
    elif data.index[i].month == 9:
        data['Month'][i] = 'September'
    elif data.index[i].month == 10:
        data['Month'][i] = 'October'
    elif data.index[i].month == 11:
        data['Month'][i] = 'November'
    elif data.index[i].month == 12:
        data['Month'][i] = 'December'`  

然后我将在groupby上使用data['Month'],然后找到均值。但是它要花很长时间才能运行,而且我每次运行该程序时都不需要等待那么长时间。我还能如何解决这个问题?

注:实际数据集与示例数据集不太相同。我将列['Yr','Mo','Dy']合并为一个名为“ DATE”的列,然后将“ DATE”作为索引。而且我还使用NaN删除了所有data.dropna(inplace=True)值。

2 个答案:

答案 0 :(得分:1)

尝试:

df.index = pd.to_datetime(df.index)
df.groupby([df.index.year, df.index.month]).mean()

             RPT        VAL        ROS  ...        CLO        BEL     MAL
DATE DATE                                   ...                              
1961 1     12.373333   9.333333  11.043333  ...   7.906667   8.833333  11.960
     2     12.230000  12.020000   8.560000  ...   9.210000  15.290000  15.125
     3     10.580000   6.630000  11.750000  ...   5.880000   5.460000  10.880
1962 3     13.330000  13.250000  11.420000  ...  10.340000  12.920000  11.830
     6     13.210000   8.120000   9.960000  ...   7.500000   8.120000  13.170
1968 7     12.230000  12.020000   8.560000  ...   9.210000  15.290000  15.125
1976 8     11.955000   9.940000  11.585000  ...   8.110000   9.190000  11.355
1978 9     13.355000  11.205000   9.730000  ...   7.730000  11.040000  13.480
     12    10.960000   9.750000   7.620000  ...  10.460000  16.620000  16.460

答案 1 :(得分:0)

我认为您尝试过的groupby方法是可行的方法:

df.groupby(['year','month'])['RPT'].mean().reset_index()