Question

我最近开始使用python进行机器学习。以下是我作为示例获取的数据集，以及到目前为止一直在研究的代码。选择[2000 .... 2015]作为测试数据和训练数据[2016，2017]。

Dataset  
      Years        Values
    0    2000      23.0
    1    2001      27.5
    2    2002      46.0
    3    2003      56.0
    4    2004      64.8
    5    2005      71.2
    6    2006      80.2
    7    2007      98.0
    8    2008     113.0
    9    2009     155.8
    10   2010     414.0
    11   2011    2297.8
    12   2012    3628.4
    13   2013   16187.8
    14   2014   25197.8
    15   2015   42987.8
    16   2016   77555.5
    17   2017  130631.9

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

df = pd.DataFrame([[i for i in range(2000,2018)], 
[23.0,27.5,46.0,56.0,64.8,71.2,80.2,98.0,113.0,155.8,414.0,2297.8,3628.4,16187.8,25197.8,42987.8,77555.5,130631.9]])


df = df.T
df.columns = ['Years', 'Values']

上面的代码创建DataFrame。要记住的另一件事是我的Years列是时间序列，而不仅仅是连续值。我没有进行任何更改以适应此要求。

我想拟合非线性模型，这可能会像我在线性模型示例中所做的那样，可以帮助打印图表。这是我使用线性模型尝试过的方法。另外，在我自己的示例中，我似乎没有考虑到Years列是一个时间序列并且不是连续的事实。

一旦有了模型，我们便希望将其用于预测至少未来两年的价值。

X = df.iloc[:, :-1].values
y = df.iloc[:, 1].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 0, shuffle = False)
lm = LinearRegression()
lm.fit(X_train, y_train)
y_pred = lm.predict(X_test)
plt.scatter(X_train, y_train, color = 'red')
plt.plot(X_train, lm.predict(X_train), color = 'blue')
plt.title('Years vs Values (training set)')
plt.xlabel('Years')
plt.ylabel('Values')
plt.show()

Answer 1

编辑：我的回答是错误的，我已被用于代替分类器的分类器；不要删除它，因为我害怕自己发布更多答案。请勿使用此答案。

尝试一下

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

df = pd.DataFrame([[i for i in range(2000,2018)], 
[23.0,27.5,46.0,56.0,64.8,71.2,80.2,98.0,113.0,155.8,414.0,2297.8,3628.4,16187.8,25197.8,42987.8,77555.5,130631.9]])


df = df.T
df.columns = ['Year', 'Values']
df['Year'] = df['Year'].astype(int)
df['Values'] = df['Values'].astype(int)

您的DataFrame

X = df[['Year']]
y = df[['Values']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 0, shuffle = False)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

clf = RandomForestClassifier(n_estimators=10)
clf.fit(X_train, y_train)


y_pred = clf.predict(X_test)

plt.scatter(X_train, y_train, color = 'red')
plt.plot(X_train, clf.predict(X_train), color = 'blue')
plt.title('Years vs Values (training set)')
plt.xlabel('Years')

plt.xticks(rotation=90)
plt.ylabel('Values')
plt.show()

Answer 2

尝试一下。您也可以打印预测值。预计5年。

import numpy.polynomial.polynomial as poly
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

df = pd.DataFrame([[i for i in range(2000,2018)],
[23.0,27.5,46.0,56.0,64.8,71.2,80.2,98.0,113.0,155.8,414.0,2297.8,3628.4,16187.8,25197.8,42987.8,77555.5,130631.9]])
df = df.T
df.columns = ['Year', 'Values']
df['Year'] = df['Year'].astype(int)
df['Values'] = df['Values'].astype(int)
no_of_predictions = 5


X = np.array(df.Year, dtype = float)
y = np.array(df.Values, dtype = float)
Z = [2019,2020,2021,2022]
coefs = poly.polyfit(X, y, 4)
X_new = np.linspace(X[0], X[-1]+no_of_predictions, num=len(X)+no_of_predictions)
ffit = poly.polyval(X_new, coefs)
pred = poly.polyval(Z, coefs)
predictions = pd.DataFrame(Z,pred)
print predictions
plt.plot(X, y, 'ro', label="Original data")
plt.plot(X_new, ffit, label = "Fitted data")
plt.legend(loc='upper left')
plt.show()

Answer 3

与此同时，我也尝试过

import numpy.polynomial.polynomial as poly
X = np.array(df.Years, dtype = float)
y = np.array(df.Values, dtype = float)
coefs = poly.polyfit(X, y, 4)
X_new = np.linspace(X[0], X[-1], num=17)
ffit = poly.polyval(X_new, coefs)
plt.plot(X, y, 'ro', label="Original data")
plt.plot(X_new, ffit, label = "Fitted data")
plt.legend(loc='upper left')
plt.show()

它的确非常适合。但是现在我不清楚如何使用这些预测值来预测未来五年的价值。

将非线性单变量回归拟合到Python中的时间序列数据

3 个答案: