Question

我是scikit的新手，需要学习如何根据多个连续数据列预测值。这里我有几个数据列，它们具有如下连续数据。（列名仅供参考）

ColA, ColB, ColC, ColD, ColE
8.0     307.0   130.0   3504.0  12.0
15.0    350.0   165.0   3693.0  11.5
18.0    318.0   150.0   3436.0  11.0
16.0    304.0   150.0   3433.0  12.0
17.0    302.0   140.0   3449.0  10.5
15.0    429.0   198.0   4341.0  10.0
14.0    454.0   220.0   4354.0  9.0
14.0    440.0   215.0   4312.0  8.5
....
....

我需要做的是根据通过输入上述数据创建的模型来预测ColA中的值。我只看到了对预测值进行分类的例子。如果给出任何/所有ColB，ColC，ColD，ColE值，如何获得实际值？

任何人都可以帮我解决这个如何用scikit做的事吗？

Answer 1

首先，我将数据转换为csv文件，以便我可以使用pandas。 csv是here

示例：

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

df = pd.read_csv('data.csv',header = None)

#Fill the missing data with 0 and the '?' that you have with 0
df = df.fillna(0)
df= df.replace('?', 0)

X = df.iloc[:,1:7]

#I assume than the y is the first column of the dataset as you said
y = df.iloc[:,0]

#I split the data X, y into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

#Convert pandas dataframes into numpy arrays (it is needed for the fitting)
X_train = X_train.values
X_test = X_test.values
y_train = y_train.values
y_test = y_test.values

#Create and fit the model
model = LinearRegression()

#Fit the model using the training data
model.fit(X_train,y_train)

#Predict unseen data
y_predicted =model.predict(X_test)
scores = model.score(X_test, y_test)

print(y_predicted)
print(scores)

第一次打印的结果是看不见的（X_test）功能的预测值 。预测值对应于数据集的第1列。

第二次打印的结果返回预测的确定系数R ^ 2。

更多here

P.S：您要解决的问题太笼统了。

首先，您可以使用sklearn中的StandardScaler方法来扩展要素（X数组）。这通常很好，它可以改善性能，但它在你身上。更多详情here

接下来，您可以使用其他方法来分割数据，而不是使用train_test_split。

最后，您可以使用其他方法代替LinearRegression。

希望这有帮助

使用scikit预测连续数据的值

1 个答案: