Pandas.resample为非整数倍频率

时间:2014-10-27 18:36:22

标签: python numpy pandas resampling

我必须将数据集从10分钟间隔重新采样到15分钟间隔,以使其与其他数据集同步。基于我在stackoverflow上的搜索,我有一些想法如何继续,但没有一个提供干净和清晰的解决方案。

问题

问题设置

#%% Import modules 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#%% make timestamps
periods = 12
startdate = '2010-01-01'
timestamp10min = pd.date_range(startdate, freq='10Min', periods=periods)


#%% Make DataFrame and fill it with some data
df = pd.DataFrame(index=timestamp10min)
y = -(np.arange(periods)-periods/2)**2
df['y'] = y 

期望的输出

现在我想要已经在10分钟的值保持不变,**:15和**:45的值是**的平均值:10,**:20和**: 40,**:50。问题的核心是15分钟不是10分钟的整数倍。否则只需应用df.resample('10Min', how='mean')即可。

可能的解决方案

  1. 只需使用15分钟重新采样,然后使用引入的小错误。

  2. 使用两种形式的重新采样,close='left', label='left'close='right' , label='right'。之后我可以平均两个重采样的表格。结果会给我一些结果错误,但比第一种方法小。

  3. 将所有内容重新采样为5分钟数据,然后应用滚动平均值。这里有类似的东西:Pandas: rolling mean by time interval

  4. 使用不同数量的输入重新取样并平均:Use numpy.average with weights for resampling a pandas array 因此,我必须创建一个具有不同重量长度的新系列。重量应该在1和2之间交替。

  5. 将所有内容重新采样为5分钟数据,然后应用线性插值。此方法接近方法3. Pandas data frame: resample with linear interpolation 编辑:@Paul H在这些方面给出了一个可行的解决方案,这是可靠的。谢谢!

  6. 所有方法对我来说都不是很满意。有些会导致一个小错误,其他方法对于局外人来说很难阅读。

    实施

    方法1,2和5的实施以及期望的输出。与可视化相结合。

    #%% start plot
    plt.figure()
    plt.plot(df.index, df['y'], label='original')
    
    #%% resample the data to 15 minutes and plot the result
    close = 'left'; label='left'
    dfresamplell = pd.DataFrame()
    dfresamplell['15min'] = df.y.resample('15Min', how='mean', closed=close, label=label)
    labelstring = 'close ' + close + ' label ' + label        
    plt.plot(dfresamplell.index, dfresamplell['15min'], label=labelstring)
    
    close = 'right'; label='right'
    dfresamplerr = pd.DataFrame()
    dfresamplerr['15min'] = df.y.resample('15Min', how='mean', closed=close, label=label)
    labelstring = 'close ' + close + ' label ' + label        
    plt.plot(dfresamplerr.index, dfresamplerr['15min'], label=labelstring)
    
    #%% make an average
    dfresampleaverage = pd.DataFrame(index=dfresamplell.index)
    dfresampleaverage['15min'] = (dfresamplell['15min'].values+dfresamplerr['15min'].values[:-1])/2
    plt.plot(dfresampleaverage.index, dfresampleaverage['15min'], label='average of both resampling methods')
    
    #%% desired output
    ydesired = np.zeros(periods/3*2)
    i = 0 
    j = 0 
    k = 0 
    for val in ydesired:
        if i+k==len(y): k=0
        ydesired[j] = np.mean([y[i],y[i+k]]) 
        j+=1
        i+=1
        if k==0: k=1; 
        else: k=0; i+=1
    plt.plot(dfresamplell.index, ydesired, label='ydesired')
    
    
    #%% suggestion of Paul H
    dfreindex = df.reindex(pd.date_range(startdate, freq='5T', periods=periods*2))
    dfreindex.interpolate(inplace=True)
    dfreindex = dfreindex.resample('15T', how='first').head()
    plt.plot(dfreindex.index, dfreindex['y'], label='method Paul H')
    
    
    #%% finalize plot
    plt.legend()
    

    """

    角度实施

    作为奖励,我添加了用于角度插值的代码。这是通过使用复数来完成的。由于尚未实现复杂插值,我将复数分成实部和虚部。平均后,这些数字可以再次转换为天使。对于某些天使来说,这是一种比简单平均两个天使更好的重采样方法,例如:345和5度。 """

    #%% make timestamps
    periods = 24*6
    startdate = '2010-01-01'
    timestamp10min = pd.date_range(startdate, freq='10Min', periods=periods)
    
    #%% Make DataFrame and fill it with some data
    degrees = np.cumsum(np.random.randn(periods)*25) % 360
    df = pd.DataFrame(index=timestamp10min)
    df['deg'] = degrees
    df['zreal'] = np.cos(df['deg']*np.pi/180)
    df['zimag'] = np.sin(df['deg']*np.pi/180)
    
    #%% suggestion of Paul H
    dfreindex = df.reindex(pd.date_range(startdate, freq='5T', periods=periods*2))
    dfreindex = dfreindex.interpolate()
    dfresample = dfreindex.resample('15T', how='first')
    
    #%% convert complex to degrees
    def f(x):    
         return np.angle(x[0] + x[1]*1j, deg=True )
    dfresample['degrees'] = dfresample[['zreal', 'zimag']].apply(f, axis=1)
    
    #%% set all the values between 0-360 degrees
    dfresample.loc[dfresample['degrees']<0] = 360 + dfresample.loc[dfresample['degrees']<0] 
    
    #%% wrong resampling
    dfresample['deg'] = dfresample['deg'] % 360
    
    #%% plot different sampling methods
    plt.figure()
    plt.plot(df.index, df['deg'], label='normal', marker='v')
    plt.plot(dfresample.index, dfresample['degrees'], label='resampled according @Paul H', marker='^')
    plt.plot(dfresample.index, dfresample['deg'], label='wrong resampling', marker='<')
    plt.legend()
    

    感谢您帮助我!

2 个答案:

答案 0 :(得分:3)

我可能误解了这个问题,但这有用吗?

TL; DR版本:

import numpy as np
import pandas

data = np.arange(0, 101, 8)
index_10T = pandas.DatetimeIndex(freq='10T', start='2012-01-01 00:00', periods=data.shape[0])
index_05T = pandas.DatetimeIndex(freq='05T', start=index_10T[0], end=index_10T[-1])
index_15T = pandas.DatetimeIndex(freq='15T', start=index_10T[0], end=index_10T[-1])
df1 = pandas.DataFrame(data=data, index=index_10T, columns=['A'])
print(df.reindex(index=index_05T).interpolate().loc[index_15T])

长版

设置虚假数据

import numpy as np
import pandas

data = np.arange(0, 101, 8)
index_10T = pandas.DatetimeIndex(freq='10T', start='2012-01-01 00:00', periods=data.shape[0])
df1 = pandas.DataFrame(data=data, index=index_10T, columns=['A'])
print(df1)


                      A
2012-01-01 00:00:00   0
2012-01-01 00:10:00   8
2012-01-01 00:20:00  16
2012-01-01 00:30:00  24
2012-01-01 00:40:00  32
2012-01-01 00:50:00  40
2012-01-01 01:00:00  48
2012-01-01 01:10:00  56
2012-01-01 01:20:00  64
2012-01-01 01:30:00  72
2012-01-01 01:40:00  80
2012-01-01 01:50:00  88
2012-01-01 02:00:00  96

然后构建一个新的5分钟索引并重新索引原始数据帧

index_05T = pandas.DatetimeIndex(freq='05T', start=index_10T[0], end=index_10T[-1])
df2 = df.reindex(index=index_05T)
print(df2)

                      A
2012-01-01 00:00:00   0
2012-01-01 00:05:00 NaN
2012-01-01 00:10:00   8
2012-01-01 00:15:00 NaN
2012-01-01 00:20:00  16
2012-01-01 00:25:00 NaN
2012-01-01 00:30:00  24
2012-01-01 00:35:00 NaN
2012-01-01 00:40:00  32
2012-01-01 00:45:00 NaN
2012-01-01 00:50:00  40
2012-01-01 00:55:00 NaN
2012-01-01 01:00:00  48
2012-01-01 01:05:00 NaN
2012-01-01 01:10:00  56
2012-01-01 01:15:00 NaN
2012-01-01 01:20:00  64
2012-01-01 01:25:00 NaN
2012-01-01 01:30:00  72
2012-01-01 01:35:00 NaN
2012-01-01 01:40:00  80
2012-01-01 01:45:00 NaN
2012-01-01 01:50:00  88
2012-01-01 01:55:00 NaN
2012-01-01 02:00:00  96

然后线性插值

print(df2.interpolate())
                      A
2012-01-01 00:00:00   0
2012-01-01 00:05:00   4
2012-01-01 00:10:00   8
2012-01-01 00:15:00  12
2012-01-01 00:20:00  16
2012-01-01 00:25:00  20
2012-01-01 00:30:00  24
2012-01-01 00:35:00  28
2012-01-01 00:40:00  32
2012-01-01 00:45:00  36
2012-01-01 00:50:00  40
2012-01-01 00:55:00  44
2012-01-01 01:00:00  48
2012-01-01 01:05:00  52
2012-01-01 01:10:00  56
2012-01-01 01:15:00  60
2012-01-01 01:20:00  64
2012-01-01 01:25:00  68
2012-01-01 01:30:00  72
2012-01-01 01:35:00  76
2012-01-01 01:40:00  80
2012-01-01 01:45:00  84
2012-01-01 01:50:00  88
2012-01-01 01:55:00  92
2012-01-01 02:00:00  96

构建一个15分钟的索引并使用它来提取数据:

index_15T = pandas.DatetimeIndex(freq='15T', start=index_10T[0], end=index_10T[-1])
print(df2.interpolate().loc[index_15T])

                      A
2012-01-01 00:00:00   0
2012-01-01 00:15:00  12
2012-01-01 00:30:00  24
2012-01-01 00:45:00  36
2012-01-01 01:00:00  48
2012-01-01 01:15:00  60
2012-01-01 01:30:00  72
2012-01-01 01:45:00  84
2012-01-01 02:00:00  96

答案 1 :(得分:0)

好的,这是一种方法。

  1. 列出您要填写的时间
  2. 制作包含所需时间和已有时间的综合索引
  3. 获取您的数据并“向前填充”
  4. 获取您的数据并“向后填充”
  5. 平均向前和向后填充
  6. 仅选择所需的行
  7. 请注意,这仅适用,因为您希望值正好位于您已有的值之间,时间。请注意,上次出现np.nan因为您没有任何后续数据。

    times_15 = []
    current = df.index[0]
    while current < df.index[-2]:
        current = current + dt.timedelta(minutes=15)
        times_15.append(current)
    combined = set(times_15) | set(df.index)
    df = df.reindex(combined).sort_index(axis=0)
    df['ff'] = df['y'].fillna(method='ffill')
    df['bf'] = df['y'].fillna(method='bfill')
    df['solution'] = df[['ff', 'bf']].mean(1)
    df.loc[times_15, :]