对齐多年的每日数据

时间:2016-01-22 16:08:21

标签: python pandas time-series

从当天不同时间测量的多年温度记录开始,我想得到一个矩形的日平均值数组,每行代表一年的数据。

数据看起来像这样

temperature.head()

date
1996-01-01 00:00:00     7.39
1996-01-01 03:00:00     6.60
1996-01-01 06:00:00     7.39
1996-01-01 09:00:00     9.50
1996-01-01 12:00:00    11.00
Name: temperature, dtype: float64

我用

计算每日平均值
import pandas as pd
daily = temperature.groupby(pd.TimeGrouper(freq='D')).mean()

哪个收益

daily.head()

date
1996-01-01     9.89625
1996-01-02    10.73625
1996-01-03     6.98500
1996-01-04     5.62250
1996-01-05     8.84625
Freq: D, Name: temperature, dtype: float64

现在最后一部分我想到的是

yearly_daily_mean = daily.groupby(pd.TimeGrouper(freq='12M', closed="left"))

但这里有一些问题。

  1. 我需要放弃数据的尾部而不是一整年。
  2. 如果数据丢失会怎样?
  3. 如何处理闰年?
  4. 下一步是什么?也就是说,如何“堆叠”(在numpy' s,而不是pandas'意义上)数据的年代?
  5. 我正在使用

    array_temperature = np.column_stack([group[1] for group in yearly_daily_mean if len(group[1]) == 365])
    

    但应该有更好的方法。

    作为附属问题,我如何选择数据年份的开始日期?

3 个答案:

答案 0 :(得分:4)

如果我理解正确,您希望将日常工作的时间序列(已经计算过的)重新整形为矩形数据框,其中不同的日期为列,不同的年份为行。
这可以通过pandas重塑功能轻松实现,例如使用pivot

一些虚拟数据:

In [45]: index = pd.date_range(start=date(1996, 1,1), end=date(2010, 6, 30), freq='D')

In [46]: daily = pd.DataFrame(index=index, data=np.random.random(size=len(index)), columns=['temperature'])

首先,我添加一年中和每年的列:

In [47]: daily['year'] = daily.index.year

In [48]: daily['day'] = daily.index.dayofyear

In [49]: daily.head()
Out[49]:
            temperature  year  day
1996-01-01     0.081774  1996    1
1996-01-02     0.694968  1996    2
1996-01-03     0.478050  1996    3
1996-01-04     0.123844  1996    4
1996-01-05     0.426150  1996    5

现在,我们可以重塑这个数据框:

In [50]: daily.pivot(index='year', columns='day', values='temperature')
Out[50]:
day        1         2      ...          365       366
year                        ...
1996  0.081774  0.694968    ...     0.679461  0.700833
1997  0.043134  0.981707    ...     0.009357       NaN
1998  0.257077  0.297290    ...     0.701941       NaN
...        ...       ...    ...          ...       ...
2008  0.047145  0.750354    ...     0.996396  0.761159
2009  0.348667  0.827057    ...     0.881424       NaN
2010  0.269743  0.872655    ...          NaN       NaN

[15 rows x 366 columns]

答案 1 :(得分:1)

假设您已经拥有来自更高频率数据的每日平均值(pd.DateTimeIndex):

daily = temperature.groupby(pd.TimeGrouper(freq='D')).mean()

IIUC,您希望将每日平均值转换为每行具有相同列数的DataFrame以捕获年度数据。当瞄准相同数量的列时,你提到闰年是一个潜在的问题。

我可以想象两种方法:

  1. 选择每row天数 - 可能为365天。为每个row选择365个连续每日数据点的滚动块,并为每个块对齐index
  2. 选择多年的数据,填补闰年的空白,并按MM-DD或一年中的天数对齐。
  3. 从每日随机data的20年半开始作为模拟日平均温度:

    index = pd.date_range(start=date(1995, 1,1), end=date(2015, 6, 30), freq='D')
    df = pd.DataFrame(index=index, data=np.random.random(size=len(index)) * 30, columns=['temperature'])
    
    df.info()
    
    <class 'pandas.core.frame.DataFrame'>
    DatetimeIndex: 7486 entries, 1995-01-01 to 2015-06-30
    Freq: D
    Data columns (total 1 columns):
    temperature    7486 non-null float64
    dtypes: float64(1)
    memory usage: 117.0 KB
    None
    
    df.head()
    
                temperature
    1995-01-01     4.119212
    1995-01-02    27.107131
    1995-01-03    26.704931
    1995-01-04     7.430203
    1995-01-05     4.230398
    
    df.tail()
    
                temperature
    2015-06-26    10.902779
    2015-06-27     8.494378
    2015-06-28    17.800131
    2015-06-29    19.543815
    2015-06-30    16.390435
    

    这是第一种方法的解决方案:

    使用.groupby(pd.TimeGrouper('365D'))选择连续365天的块,并将每个平均每个平均值的groupby对象返回pd.DataFrame,其中integer index运行每个序列从0364

    aligned = df.groupby(pd.TimeGrouper(freq='365D')).apply(lambda x: pd.DataFrame(x.squeeze().tolist())) # .squeeze() converts single columns `DataFrame` to pd.Series
    

    要对齐21个数据块,只需转置pd.DataFrame,它们将在integer索引columns, with the start date of each sequence in the索引. This operation will produce an extra中按, and the last索引对齐row`将有一些缺失的数据。用以下方法清理两者:

    aligned.dropna().reset_index(-1, drop=True)
    

    获取[20 x 365] DataFrame,如下所示:

    DatetimeIndex: 20 entries, 1995-01-01 to 2013-12-27
    Freq: 365D
    Columns: 365 entries, 0 to 364
    dtypes: float64(365)
    memory usage: 57.2 KB
    
                      0          1          2          3          4          5    \
    1995-01-01  29.456090  25.313968   4.146206   5.347690  25.767425  11.978152   
    1996-01-01  25.585481  26.846486   8.336905  16.749842   6.247542  17.723733   
    1996-12-31  23.410462  10.168599   5.601917  11.996500   8.650726  23.362815   
    1997-12-31   7.586873  23.882106  22.145595   3.287160  21.642547   1.949321   
    1998-12-31  14.691420   3.611475  28.287327  25.347787  13.291708  20.571616   
    1999-12-31  25.713866  17.588570  18.562117  19.420944  12.406293  11.870750   
    2000-12-30   5.099561  17.894763  21.168223   4.786461  24.521417  21.443607   
    2001-12-30  11.791223   8.352493  12.731769   0.459697  20.680396  27.554783   
    2002-12-30   3.785876   0.359850  20.828764  15.376991  14.086626   0.477615   
    2003-12-30  23.633243  12.726250   8.197824  16.355956   8.094145   1.410746   
    2004-12-29   1.139949   4.161267   9.043062  14.109888  13.538735   1.566002   
    2005-12-29  25.504224  19.346419   3.300641  26.933084  23.634321  18.323450   
    2006-12-29  10.535785   9.168498  27.222106  11.962343  10.004678  23.893257   
    2007-12-29  27.482856   6.910670   6.033291  12.673530  26.362971   4.492178   
    2008-12-28  11.152316  25.233664  22.124299  11.012285   1.992814  25.542204   
    2009-12-28  23.131021  16.363467   1.242393  10.387653   4.858851  26.553950   
    2010-12-28  13.134843   9.195658  19.075850  28.539387   3.075934   8.089347   
    2011-12-28  28.860275  10.121573   0.663906  19.687892  29.376377  11.488446   
    2012-12-27   7.644073  19.649330  25.497595   6.592940   8.879444  17.733670   
    2013-12-27  11.713996   2.602284   3.835302  22.244623  27.279810  14.144943   
    
                      6          7          8          9      ...            355  \
    1995-01-01   8.210005   8.129146  28.798472  25.646924    ...      24.177163   
    1996-01-01   0.481487  16.772357   3.934185  22.640157    ...      23.340931   
    1996-12-31  10.813812  16.276504   3.422665  14.916229    ...      13.817015   
    1997-12-31  19.184753  28.628326  22.134871  12.721064    ...      23.905483   
    1998-12-31   2.839492   7.889141  17.951959  25.233585    ...      28.002751   
    1999-12-31   6.958672  26.335427  23.361470   5.911806    ...       7.778412   
    2000-12-30   8.405042  25.229016  19.746462  15.332004    ...       5.703830   
    2001-12-30   0.558788  15.457327  20.987186  25.452723    ...      29.771372   
    2002-12-30  19.002685  26.455754  25.468178  25.383786    ...      14.238987   
    2003-12-30  22.984328  15.934398  25.361599  12.221306    ...       1.189949   
    2004-12-29  22.121901  21.421103  26.175702  16.040881    ...      19.945408   
    2005-12-29   2.557901  15.193412  27.049389   4.825570    ...       7.629859   
    2006-12-29   8.582602  26.037375   0.933591  13.469771    ...      29.453932   
    2007-12-29  29.437921  26.470153   9.917871  16.875801    ...       5.702116   
    2008-12-28   3.809633  10.583385  18.029571   0.440077    ...      11.337894   
    2009-12-28  24.406696  28.294553  19.929563   4.683991    ...      25.697446   
    2010-12-28  29.765551  16.716723   6.467946  10.998447    ...      26.988863   
    2011-12-28  28.962746  11.407137   9.957111   4.502521    ...      14.606937   
    2012-12-27   1.374502   5.571244  11.212960   9.949830    ...      23.345868   
    2013-12-27  26.373866   4.781510  16.828510  10.280078    ...       0.552726   
    
                      356        357        358        359        360        361  \
    1995-01-01  13.511951  10.126835  28.121730  23.275360  11.785242  27.907039   
    1996-01-01  13.362737  14.336780  24.114908  28.479688   8.509069  17.408937   
    1996-12-31  19.192674   1.146844  27.499688   7.090407   2.777819  22.826814   
    1997-12-31  21.502186  10.495148  21.786895  12.229181   8.068271   6.522108   
    1998-12-31  21.338355  11.978265   9.186161  21.053924   3.033370  29.934703   
    1999-12-31   5.960120  20.325684   0.915052  15.059979  12.194240  20.138567   
    2000-12-30  11.883186   2.764768  27.324304  29.630706  21.852058  20.416199   
    2001-12-30   7.802891  25.384479   9.044486   8.809446   7.606603   6.051890   
    2002-12-30   7.362494   8.940783   5.259984   7.035818  24.094134   7.197113   
    2003-12-30  25.596902   9.756372   6.345198   1.520188  22.752717   3.470268   
    2004-12-29  26.789064   9.708466  18.287838  21.134643  29.862135  19.926086   
    2005-12-29  26.398394  24.717514  16.606042  28.189245  24.574806  14.297410   
    2006-12-29   8.795342  18.019536  16.579878  20.368811  22.052442  26.393676   
    2007-12-29   8.696240  25.901889  16.410934  15.274897  14.365867  10.523388   
    2008-12-28  18.581513  25.974784  21.025297  10.521118   5.864974   2.373023   
    2009-12-28  14.437944  21.717456   4.017870  14.024522   0.959989  17.215403   
    2010-12-28  11.426540  13.751451   4.664761  15.373878   7.731613   7.269089   
    2011-12-28   1.952897   9.406866  28.957258  20.239517  11.156958  29.238761   
    2012-12-27   7.588643  21.186675  17.348911   1.354323  13.918083   3.034123   
    2013-12-27  22.916065   2.089675  22.832061  14.787841  25.697875  14.087893   
    
                      362        363        364  
    1995-01-01  13.107523  10.740551  20.511825  
    1996-01-01  25.016219  17.885332   2.438875  
    1996-12-31  24.692327   0.221760   6.749919  
    1997-12-31  24.856169   0.930019  22.603652  
    1998-12-31  18.361414  13.587695  25.161495  
    1999-12-31   0.512120  26.482288   1.035197  
    2000-12-30  15.401012  28.334219   5.965014  
    2001-12-30  10.292213  10.951915   8.270319  
    2002-12-30  21.945734  27.076438   6.795688  
    2003-12-30  14.788929  19.456459  11.216835  
    2004-12-29   7.086443  25.463503  17.549196  
    2005-12-29  12.252487  29.081547  25.507369  
    2006-12-29   0.012617   0.086186  17.421958  
    2007-12-29   4.191633  21.588891   7.516187  
    2008-12-28  26.194288  20.500256  24.876032  
    2009-12-28  28.445254  27.338754   7.849899  
    2010-12-28  28.888573  26.801262  23.117027  
    2011-12-28  19.871547  20.324514  18.369134  
    2012-12-27  15.907752   9.417700   4.922940  
    2013-12-27  21.132385  20.707216   5.288128  
    
    [20 rows x 365 columns]
    

    如果您想简单地收集数据的年份并按日期对齐,那么非闰年的缺失日期大约为60(而不是366),您可以:

    df.groupby(pd.TimeGrouper(freq='A')).apply(lambda x: pd.DataFrame(x.squeeze().tolist()).T).reset_index(-1, drop=True).iloc
    
                      0          1          2          3          4          5    \
    1995-12-31   1.245796  28.487530   0.574299  10.033485  19.221512   8.718728   
    1996-12-31  12.258653   3.864652  25.237088  13.982809  24.494746  13.822292   
    1997-12-31  22.239412   4.796824  21.389404  11.151171  25.577368   1.754948   
    1998-12-31  24.968287   2.089894  25.888487  28.291714  19.115844  24.426285   
    1999-12-31   9.285363  19.339405  26.012193   3.243394  25.176499   8.766770   
    2000-12-31  26.996573  26.404391   1.793644  21.314488  13.118279  26.703532   
    2001-12-31  16.303829  14.021771  20.828238  11.427195   3.099290  18.730795   
    2002-12-31  14.614617  10.694258   5.226033  24.900849  17.395822  22.154202   
    2003-12-31  10.564132   8.267639   7.778573  26.704936   5.671499   0.470963   
    2004-12-31  22.649623  15.725867  18.445629   7.529507  11.868134  10.965534   
    2005-12-31   2.406615   9.709624  23.284616  11.479254  23.814725   1.656826   
    2006-12-31  19.164459  23.177769  16.091672  28.936777  28.636072   4.838555   
    2007-12-31  12.371377   3.417582  21.067689  25.493921  25.410295  15.526614   
    2008-12-31  29.080385   4.653984  16.567333  24.248921  27.338538   9.353291   
    2009-12-31  29.608734   6.046593  22.738628  22.631714  26.061903  21.217846   
    2010-12-31  27.458254  15.146497  18.917073   8.473955  26.782767  10.891648   
    2011-12-31  25.433759   8.959650  14.343507  16.249726  17.031174  12.944418   
    2012-12-31  22.940797   4.791280  11.765939  25.925645   3.649440  27.483407   
    2013-12-31  11.684391  27.701678  27.423083  27.656086   9.374896  14.250936   
    2014-12-31  23.660098  27.768960  25.753294   3.014606  23.330226  17.570492   
    
                      6          7          8          9      ...            356  \
    1995-12-31  17.079137  26.100763  12.376462  12.315219    ...      16.910185   
    1996-12-31  26.718277  10.349412  12.940624   9.453769    ...      19.235435   
    1997-12-31  20.201528  22.895552   1.443243  20.584140    ...      29.665815   
    1998-12-31  21.493163  16.724328   5.946833  15.230762    ...       2.617883   
    1999-12-31   9.776013  13.381424  11.028295   1.905501    ...       7.200409   
    2000-12-31   9.773097  14.565345  22.578398   0.688273    ...      18.119020   
    2001-12-31   1.095308  14.817514  25.652418   8.327481    ...      15.385689   
    2002-12-31  29.744794  15.545211   6.373948  13.451261    ...       7.446414   
    2003-12-31  14.971959  25.948332  21.596976   5.355589    ...      23.676867   
    2004-12-31   0.604113   2.858745   0.120340  19.365223    ...       0.336213   
    2005-12-31   6.260722   9.819337  19.573953  11.132919    ...      26.107100   
    2006-12-31  10.341241  15.126506   3.349634  23.619127    ...      15.508680   
    2007-12-31  20.033540  22.103483   7.674852   1.263726    ...      15.148461   
    2008-12-31  28.233973  27.982105  17.037928   5.389418    ...       8.773618   
    2009-12-31   4.400039   7.284556  11.825382   4.201001    ...       6.734423   
    2010-12-31  26.086305  26.275027   8.069376  19.200344    ...      19.056528   
    2011-12-31  29.215028   0.985623   4.813478   7.752540    ...      14.395423   
    2012-12-31   4.690336   9.618306  25.492041  10.400292    ...       8.853903   
    2013-12-31   8.227096  11.013431   0.996911  15.276574    ...      26.227540   
    2014-12-31  23.440591  16.544698   2.263684   3.919315    ...      24.987387   
    
                      357        358        359        360        361        362  \
    1995-12-31  24.791125  21.443534  21.092439   8.289222   9.745293  20.084046   
    1996-12-31   2.632656   2.102163  24.828437  18.104255   7.951859   3.266873   
    1997-12-31  11.246534  14.086539  29.635519  19.518642  24.086108   6.041870   
    1998-12-31  29.961162   9.924863   9.401790  25.597344  13.885467  16.537406   
    1999-12-31   3.057125  15.241720   8.472388   3.248545  11.302522  19.283612   
    2000-12-31  22.999729  17.518504  10.058249   2.953903  10.167712  17.309525   
    2001-12-31  18.267445  23.205300  25.658591  19.915797  10.704525  26.604965   
    2002-12-31  11.497110   3.641206   9.693428  24.571510   6.438652  29.280098   
    2003-12-31  23.931401  19.967615   0.307896   0.385782   0.579257   7.534806   
    2004-12-31  21.321146   9.224362   1.703842   6.180944  28.173925   5.178336   
    2005-12-31  17.990409  28.746179   2.524899  10.555224  25.487723  19.877390   
    2006-12-31   9.748760  29.069966   1.717175   3.283069   9.615215  25.787787   
    2007-12-31  29.772930  20.892030  16.597493  20.079373  17.320327   9.583089   
    2008-12-31  22.787891  26.636413  13.872783  29.305847  21.287553   1.263788   
    2009-12-31   1.574188  23.172773   0.967153   1.928999  12.201354   0.125939   
    2010-12-31  20.566125   0.429552   4.413156  16.106451  27.745684  18.280928   
    2011-12-31   9.348584   2.604338  23.397221   7.378340  16.757224  29.364973   
    2012-12-31   4.704570   7.278321  19.034622  24.597784  13.694635  15.912901   
    2013-12-31  21.657446  14.110146  23.976991   8.203509  20.083490   4.471119   
    2014-12-31  14.465823   9.105391  15.984162   6.796756   8.232619  18.761280   
    
                      363        364        365  
    1995-12-31  28.165022   9.735041        NaN  
    1996-12-31  11.644543   4.139818   5.420238  
    1997-12-31   2.500165  18.290531        NaN  
    1998-12-31  23.856333  10.064951        NaN  
    1999-12-31   3.090008  26.203395        NaN  
    2000-12-31  22.216599  27.942821   0.791318  
    2001-12-31  25.682003   4.766435        NaN  
    2002-12-31  19.785159  28.972659        NaN  
    2003-12-31  15.692168  21.388069        NaN  
    2004-12-31   9.079675   7.392328  12.583179  
    2005-12-31  18.202333  21.895494        NaN  
    2006-12-31  20.951937  26.220226        NaN  
    2007-12-31  23.603166  28.165377        NaN  
    2008-12-31  20.532933   9.401494  25.296916  
    2009-12-31   5.879644  10.377044        NaN  
    2010-12-31   0.436284  20.875852        NaN  
    2011-12-31  13.205290   6.832805        NaN  
    2012-12-31  23.253155  17.760731  23.270751  
    2013-12-31  19.807798   2.453238        NaN  
    2014-12-31  12.817601  11.756561        NaN  
    
    [20 rows x 366 columns]
    

答案 2 :(得分:1)

我将如何做到这一点。非常简单:用你想要的确切形状创建一个新的df,然后用你想要的东西填充它。

from datetime import datetime
import numpy as np
import pandas as pd

# This is my re-creation of the data you have. (I'm calling it df1.)
# It's essential that your date-time be in datetime.datetime format, not strings
byear = 1996  # arbitrary
eyear = 2005  # arbitrary
obs_n = 50000 # arbitrary
start_time = datetime.timestamp(datetime(byear,1,1,0,0,0,0))
end_time = datetime.timestamp(datetime(eyear,12,31,23,59,59,999999))
obs_times = np.linspace(start_time,end_time,num=obs_n)
index1 = pd.Index([datetime.fromtimestamp(i) for i in obs_times])
df1 = pd.DataFrame(data=np.random.rand(obs_n)*20,index=index1,columns=['temp'])
# ^some random data

# Here is the new empty dataframe (df2) where you will put your daily averages. 
index2 = pd.Index(range(byear,eyear+1))
columns2 = range(1,367) # change to 366 if you want to assume 365-day years
df2 = pd.DataFrame(index=index2,columns=columns2)

# Some quick manipulations that allow the two dfs' indexes to talk to one another.
df1['year'] = df1.index.year # a new column with the observation's year as an integer
df1['day'] = df1.index.dayofyear # a new column with the day of the year as integer
df1 = df1.reset_index().set_index(['year','day'])

# Now get the averages for each day and assign them to df2.
for year in index2:
    for day in columns2[:365]: # for all but the last entry in the range
        df2.loc[year,day] = df1.loc[(year,day),'temp'].mean()
    if (year,366) in df1.index: # then if it's a leap year...
        df2.loc[year,366] = df1.loc[(year,366),'temp'].mean()

如果你不希望最后的df在第366天有任何空值,那么你可以删除最后的if语句,并重写columns2 = range(1,366),然后df2将全部为非 - 空值(假设在观察到的时间段内每天至少有一次测量)。