Question

我有过去几年数据库增长（就大小而言）的历史记录。我试图找出最好的方法/图表，可以根据历史记录向我展示数据库的未来增长，当然，如果我们添加一个新表并且这也会增长，这将不会有帮助，但我是只是想找到一种方法来估计它。我对Python或R

的想法持开放态度

以下是TB中数据库的大小：

3.895 - 2012
6.863 - 2013
8.997 - 2014
10.626 - 2015

Answer 1

将几块numpy和scipy粘合在一起，您可以使用连续近似使用数据的一阶和二阶导数进行合理的近似。

有可能更好的方法来做到这一点，但这对我有用。

import numpy as np
import scipy.interpolate
import matplotlib.pyplot as plt
import matplotlib

x = np.array([2012, 2013, 2014, 2015])
y = np.array([3.895, 6.863, 8.997, 10.626])

# interpolate to approximate a continuous version of hard drive usage over time
f = scipy.interpolate.interp1d(x, y, kind='quadratic')

# approximate the first and second derivatives near the last point (2015)
dx = 0.01
x0 = x[-1] - 2*dx
first = scipy.misc.derivative(f, x0, dx=dx, n=1)
second = scipy.misc.derivative(f, x0, dx=dx, n=2)

# taylor series approximation near x[-1]
forecast = lambda x_new: np.poly1d([second/2, first, f(x[-1])])(x_new - x[-1])
forecast(2016)  # 11.9

xs = np.arange(2012, 2020)
ys = forecast(xs)

# needed to prevent matplotlib from putting the x-axis in scientific notation
x_formatter = matplotlib.ticker.ScalarFormatter(useOffset=False)  
plt.gca().xaxis.set_major_formatter(x_formatter)

plt.plot(xs, ys)

enter image description here

Answer 2

d <- data.frame(x= 2012:2015,
            y = c(3.895, 6.863, 8.997, 10.626))

您可以看到拟合（及其投影）：这里我正在比较加法和多项式模型。我不确定我是否相信添加剂模型的置信区间，但是：

library("ggplot2"); theme_set(theme_bw())
ggplot(d,aes(x,y))+ geom_point() +
    expand_limits(x=2018)+
    geom_smooth(method="lm",formula=y~poly(x,2),
                fullrange=TRUE,fill="blue")+
    geom_smooth(method="gam",formula=y~s(x,k=3),colour="red",
                fullrange=TRUE,fill="red")

enter image description here

我有点震惊，二次关系如此接近。

summary(m1 <- lm(y~poly(x,2),data=d))
## Residual standard error: 0.07357 on 1 degrees of freedom
## Multiple R-squared:  0.9998, Adjusted R-squared:  0.9994 
## F-statistic:  2344 on 2 and 1 DF,  p-value: 0.0146

预测：

predict(m1,newdata=data.frame(x=2016:2018),interval="confidence")
##        fit      lwr      upr
## 1 11.50325 8.901008 14.10549
## 2 11.72745 6.361774 17.09313
## 3 11.28215 2.192911 20.37139

你是否编造了这些数字，或者它们是真实的数据？

对于更复杂的方法，forecast()包会更好。

Answer 3

第二个想法，你真正想要使用的是Gaussian Process。

import numpy as np
import sklearn.gaussian_process
import pandas as pd
import matplotlib

np.random.seed(1)

X = np.atleast_2d([2012, 2013, 2014, 2015]).T
y = np.array([3.895, 6.863, 8.997, 10.626])

x_new = np.atleast_2d(np.linspace(2012, 2018, 1000)).T

gp = sklearn.gaussian_process.GaussianProcess()
gp.fit(X, y)
y_pred, MSE = gp.predict(x_new, eval_MSE=True)
sigma = np.sqrt(MSE)

df = pd.DataFrame(dict(prediction=y_pred, se=sigma), index=x_new)
df.plot(yerr='se')

虽然基础知识很强，但Python需要更好的可视化库。即使让x轴显示整数（而不是使用科学记数法）也是不必要的。

enter image description here

使用历史数据样本估算未来增长

3 个答案: