外推Pandas `DataFrame` s

Question

在Pandas DataFrame中插入NaN单元非常容易：

In [98]: df
Out[98]:
            neg       neu       pos       avg
250    0.508475  0.527027  0.641292  0.558931
500         NaN       NaN       NaN       NaN
1000   0.650000  0.571429  0.653983  0.625137
2000        NaN       NaN       NaN       NaN
3000   0.619718  0.663158  0.665468  0.649448
4000        NaN       NaN       NaN       NaN
6000        NaN       NaN       NaN       NaN
8000        NaN       NaN       NaN       NaN
10000       NaN       NaN       NaN       NaN
20000       NaN       NaN       NaN       NaN
30000       NaN       NaN       NaN       NaN
50000       NaN       NaN       NaN       NaN

[12 rows x 4 columns]

In [99]: df.interpolate(method='nearest', axis=0)
Out[99]:
            neg       neu       pos       avg
250    0.508475  0.527027  0.641292  0.558931
500    0.508475  0.527027  0.641292  0.558931
1000   0.650000  0.571429  0.653983  0.625137
2000   0.650000  0.571429  0.653983  0.625137
3000   0.619718  0.663158  0.665468  0.649448
4000        NaN       NaN       NaN       NaN
6000        NaN       NaN       NaN       NaN
8000        NaN       NaN       NaN       NaN
10000       NaN       NaN       NaN       NaN
20000       NaN       NaN       NaN       NaN
30000       NaN       NaN       NaN       NaN
50000       NaN       NaN       NaN       NaN

[12 rows x 4 columns]

我还希望它使用给定的方法推断插值范围之外的NaN值。我怎么能最好地做到这一点？

Answer 1

外推Pandas `DataFrame` s

DataFrame可能是外推的，但是，在pandas中没有简单的方法调用，需要另一个库（例如scipy.optimize）。

外推

一般来说，推断需要一个人来推断某些assumptions about the data。一种方法是通过curve fitting对数据进行一些通用参数化方程，以找到最能描述现有数据的参数值，然后将其用于计算超出此数据范围的值。这种方法的困难和限制问题是，当选择参数化方程时，必须对趋势做出一些假设。这可以通过不同方程的试验和误差来找到，以给出期望的结果，或者有时可以从数据源推断出它。问题中提供的数据实际上不足以获得良好拟合曲线的数据集;但是，它足以说明。

以下是使用3 ^rd阶多项式外推DataFrame的示例

f （ x ）= a x ³ + b x ² + c x + d {{3 }}

此通用函数（func()）曲线拟合到每列上以获取唯一的列特定参数（即 a ， b ， c < / em>， d ）。然后，这些参数化方程用于推导NaN s所有索引的每列中的数据。

import pandas as pd from cStringIO import StringIO from scipy.optimize import curve_fit df = pd.read_table(StringIO(''' neg neu pos avg 0 NaN NaN NaN NaN 250 0.508475 0.527027 0.641292 0.558931 500 NaN NaN NaN NaN 1000 0.650000 0.571429 0.653983 0.625137 2000 NaN NaN NaN NaN 3000 0.619718 0.663158 0.665468 0.649448 4000 NaN NaN NaN NaN 6000 NaN NaN NaN NaN 8000 NaN NaN NaN NaN 10000 NaN NaN NaN NaN 20000 NaN NaN NaN NaN 30000 NaN NaN NaN NaN 50000 NaN NaN NaN NaN'''), sep='\s+') # Do the original interpolation df.interpolate(method='nearest', xis=0, inplace=True) # Display result print 'Interpolated data:' print df print # Function to curve fit to the data def func(x, a, b, c, d): return a * (x ** 3) + b * (x ** 2) + c * x + d # Initial parameter guess, just to kick off the optimization guess = (0.5, 0.5, 0.5, 0.5) # Create copy of data to remove NaNs for curve fitting fit_df = df.dropna() # Place to store function parameters for each column col_params = {} # Curve fit each column for col in fit_df.columns: # Get x & y x = fit_df.index.astype(float).values y = fit_df[col].values # Curve fit column and get curve parameters params = curve_fit(func, x, y, guess) # Store optimized parameters col_params[col] = params[0] # Extrapolate each column for col in df.columns: # Get the index values for NaNs in the column x = df[pd.isnull(df[col])].index.astype(float).values # Extrapolate those points with the fitted function df[col][x] = func(x, *col_params[col]) # Display result print 'Extrapolated data:' print df print print 'Data was extrapolated with these column functions:' for col in col_params: print 'f_{}(x) = {:0.3e} x^3 + {:0.3e} x^2 + {:0.4f} x + {:0.4f}'.format(col, *col_params[col])

推断结果

Interpolated data: neg neu pos avg 0 NaN NaN NaN NaN 250 0.508475 0.527027 0.641292 0.558931 500 0.508475 0.527027 0.641292 0.558931 1000 0.650000 0.571429 0.653983 0.625137 2000 0.650000 0.571429 0.653983 0.625137 3000 0.619718 0.663158 0.665468 0.649448 4000 NaN NaN NaN NaN 6000 NaN NaN NaN NaN 8000 NaN NaN NaN NaN 10000 NaN NaN NaN NaN 20000 NaN NaN NaN NaN 30000 NaN NaN NaN NaN 50000 NaN NaN NaN NaN Extrapolated data: neg neu pos avg 0 0.411206 0.486983 0.631233 0.509807 250 0.508475 0.527027 0.641292 0.558931 500 0.508475 0.527027 0.641292 0.558931 1000 0.650000 0.571429 0.653983 0.625137 2000 0.650000 0.571429 0.653983 0.625137 3000 0.619718 0.663158 0.665468 0.649448 4000 0.621036 0.969232 0.708464 0.766245 6000 1.197762 2.799529 0.991552 1.662954 8000 3.281869 7.191776 1.702860 4.058855 10000 7.767992 15.272849 3.041316 8.694096 20000 97.540944 150.451269 26.103320 91.365599 30000 381.559069 546.881749 94.683310 341.042883 50000 1979.646859 2686.936912 467.861511 1711.489069 Data was extrapolated with these column functions: f_neg(x) = 1.864e-11 x^3 + -1.471e-07 x^2 + 0.0003 x + 0.4112 f_neu(x) = 2.348e-11 x^3 + -1.023e-07 x^2 + 0.0002 x + 0.4870 f_avg(x) = 1.542e-11 x^3 + -9.016e-08 x^2 + 0.0002 x + 0.5098 f_pos(x) = 4.144e-12 x^3 + -2.107e-08 x^2 + 0.0000 x + 0.6312

avg列
的图表
(Eq. 1)

如果没有更大的数据集或知道数据的来源，这个结果可能完全错误，但应该例证推断DataFrame的过程。 func()中的假设等式可能需要播放以获得正确的外推法。此外，没有尝试使代码有效。

<强>更新

如果您的索引是非数字的，例如DatetimeIndex，，那么如何推断它们。

Answer 2

import pandas as pd
try:
    # for Python2
    from cStringIO import StringIO 
except ImportError:
    # for Python3
    from io import StringIO

df = pd.read_table(StringIO('''
                neg       neu       pos       avg
    0           NaN       NaN       NaN       NaN
    250    0.508475  0.527027  0.641292  0.558931
    999         NaN       NaN       NaN       NaN
    1000   0.650000  0.571429  0.653983  0.625137
    2000        NaN       NaN       NaN       NaN
    3000   0.619718  0.663158  0.665468  0.649448
    4000        NaN       NaN       NaN       NaN
    6000        NaN       NaN       NaN       NaN
    8000        NaN       NaN       NaN       NaN
    10000       NaN       NaN       NaN       NaN
    20000       NaN       NaN       NaN       NaN
    30000       NaN       NaN       NaN       NaN
    50000       NaN       NaN       NaN       NaN'''), sep='\s+')

print(df.interpolate(method='nearest', axis=0).ffill().bfill())

产量

            neg       neu       pos       avg
0      0.508475  0.527027  0.641292  0.558931
250    0.508475  0.527027  0.641292  0.558931
999    0.650000  0.571429  0.653983  0.625137
1000   0.650000  0.571429  0.653983  0.625137
2000   0.650000  0.571429  0.653983  0.625137
3000   0.619718  0.663158  0.665468  0.649448
4000   0.619718  0.663158  0.665468  0.649448
6000   0.619718  0.663158  0.665468  0.649448
8000   0.619718  0.663158  0.665468  0.649448
10000  0.619718  0.663158  0.665468  0.649448
20000  0.619718  0.663158  0.665468  0.649448
30000  0.619718  0.663158  0.665468  0.649448
50000  0.619718  0.663158  0.665468  0.649448

注意：我稍稍更改了您的df以显示nearest的插值与执行df.fillna的方式不同。（参见索引为999的行。）

我还添加了一行索引为0的NaN，以表明bfill()也可能是必需的。

Answer 3

我遇到了同样的问题，但我找不到任何特定于 Pandas 的直接和有用的（没有定义新函数）。但是，我发现 InterpolatedUnivariateSpline （来自 scipy）对于外推非常有用。它可以给你改变订单的灵活性，而不是给你一个常数。

这是相关的例子：

import matplotlib.pyplot as plt
from scipy.interpolate import InterpolatedUnivariateSpline
x = np.linspace(-3, 3, 50)
y = np.exp(-x**2) + 0.1 * np.random.randn(50)
spl = InterpolatedUnivariateSpline(x, y)
plt.plot(x, y, 'ro', ms=5)
xs = np.linspace(-3, 3, 1000)
plt.plot(xs, spl(xs), 'g', lw=3, alpha=0.7)
plt.show()

在Pandas DataFrame中推断值

3 个答案:

外推Pandas `DataFrame` s

外推

推断结果

`avg`列

在Pandas DataFrame中推断值

3 个答案:

外推Pandas DataFrame s

外推

推断结果

avg列

外推Pandas `DataFrame` s

`avg`列