推断Pandas DataFrame

时间:2015-12-08 15:11:25

标签: python python-2.7 pandas extrapolation

使用Pandas.DataFrameSeries.interpolate中插值很容易,如何进行外推?

例如,给定一个如图所示的DataFrame,我们怎样才能将它推断到2014-12-31多14个月?线性外推很好。

X1 = range(10)
X2 = map(lambda x: x**2, X1)
df = pd.DataFrame({'x1': X1, 'x2': X2},  index=pd.date_range('20130101',periods=10,freq='M'))

我认为必须首先创建一个新的DataFrame,DateTimeIndex从2013-11-31开始,并延长14个M个句点。除此之外,我被困住了。

enter image description here

1 个答案:

答案 0 :(得分:12)

使用DataFrame索引

外推DatetimeIndex

这可以通过两个步骤完成:

  1. 扩展DatetimeIndex
  2. 推断数据
  3. 扩展索引

    使用新的df覆盖DataFrame,其数据为resampled,并根据原始index's start, period and frequency覆盖新的扩展索引。这允许原始df来自任何地方,如csv示例中的情况。这样,列可以方便地filled with NaNs

    # Fake DataFrame for example (could come from anywhere)
    X1 = range(10)
    X2 = map(lambda x: x**2, X1)
    df = pd.DataFrame({'x1': X1, 'x2': X2},  index=pd.date_range('20130101',periods=10,freq='M'))
    
    # Number of months to extend
    extend = 5
    
    # Extrapolate the index first based on original index
    df = pd.DataFrame(
        data=df,
        index=pd.date_range(
            start=df.index[0],
            periods=len(df.index) + extend,
            freq=df.index.freq
        )
    )
    
    # Display
    print df
    
        x1  x2
    2013-01-31   0   0
    2013-02-28   1   1
    2013-03-31   2   4
    2013-04-30   3   9
    2013-05-31   4  16
    2013-06-30   5  25
    2013-07-31   6  36
    2013-08-31   7  49
    2013-09-30   8  64
    2013-10-31   9  81
    2013-11-30 NaN NaN
    2013-12-31 NaN NaN
    2014-01-31 NaN NaN
    2014-02-28 NaN NaN
    2014-03-31 NaN NaN
    

    推断数据

    大多数外推者都要求输入数字而不是日期。这可以通过

    完成
    # Temporarily remove dates and make index numeric
    di = df.index
    df = df.reset_index().drop('index', 1)
    

    有关如何使用answer推断DataFrame的每列的值,请参阅此3rd order polynomial

      

    来自answer

    的摘录      
    # Curve fit each column
    for col in fit_df.columns:
        # Get x & y
        x = fit_df.index.astype(float).values
        y = fit_df[col].values
        # Curve fit column and get curve parameters
        params = curve_fit(func, x, y, guess)
        # Store optimized parameters
        col_params[col] = params[0]
    
    # Extrapolate each column
    for col in df.columns:
        # Get the index values for NaNs in the column
        x = df[pd.isnull(df[col])].index.astype(float).values
        # Extrapolate those points with the fitted function
        df[col][x] = func(x, *col_params[col])
    

    一旦推断出列,请将日期放回

    # Put date index back
    df.index = di
    
    # Display
    print df
    
    x1   x2
    2013-01-31   0    0
    2013-02-28   1    1
    2013-03-31   2    4
    2013-04-30   3    9
    2013-05-31   4   16
    2013-06-30   5   25
    2013-07-31   6   36
    2013-08-31   7   49
    2013-09-30   8   64
    2013-10-31   9   81
    2013-11-30  10  100
    2013-12-31  11  121
    2014-01-31  12  144
    2014-02-28  13  169
    2014-03-31  14  196