Python中的多个线性回归机器学习

时间:2019-12-03 08:05:39

标签: python-3.x machine-learning scikit-learn

我正在尝试使用多元线性回归机器学习基于某些输入来评估输出。我已经训练了数据并在运行以下代码时获得了正确的期望值:

dataset = pd.read_excel('TEST.xlsx')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 5].values

# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[:, 0] = labelencoder.fit_transform(X[:, 0])  # 1ST COLUMN 

labelencoder1 = LabelEncoder()
X[:, 1] = labelencoder1.fit_transform(X[:, 1])  # 2ND COLUMN 

labelencoder2 = LabelEncoder()
X[:, 2] = labelencoder2.fit_transform(X[:, 2]) #  # 3RD COLUMN 

onehotencoder = OneHotEncoder(categorical_features = "all")
X = onehotencoder.fit_transform(X).toarray()

# Avoiding the Dummy Variable Trap
X = X[:, 1:]

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)   # TILL HERE ITS WORKING AS EXPECTED

现在我正尝试使用相同的模型来评估另一组输入数据,如下所示:

dataset1 = pd.read_excel('TEST1.xlsx')  # NEW SET OF INPUT RECORDS TO BE EVALUATE
X1 = dataset1.iloc[:, :-1].values
# Encoding categorical data
labelencoder3 = LabelEncoder()
X1[:, 0] = labelencoder3.fit_transform(X1[:, 0])

labelencoder4 = LabelEncoder()
X1[:, 1] = labelencoder4.fit_transform(X1[:, 1])

labelencoder5 = LabelEncoder()
X1[:, 2] = labelencoder5.fit_transform(X1[:, 2])

onehotencoder2 = OneHotEncoder(categorical_features = "all")
X1 = onehotencoder2.fit_transform(X1).toarray()
X1 = X1[:, 1:]
output = regressor.predict(X1) 

但是当我运行此代码时,出现以下错误:

  

ValueError:形状(6,13)和(390,)不对齐:13(dim 1)!= 390(dim 0)

如果有人帮助我解决此问题,那将是很好的事情。

1 个答案:

答案 0 :(得分:0)

X和X1之间的未来大小是否相同?
例如,如果X包含五个单词,则用OneHotEncoder转换的X的形状为(n,5),regressor.fit(X_train,y_train)返回y = b + a1x1 + a2x2 +…a5x5的异议
例如,如果X_1包含10个单词,则用OneHotEnd转换的X_1的形状为(n,10),这需要y = b + a1x1 + a2x2 + .... a10x10的回归对象来计算X_1,该X_1是由仅包含10个单词的数据进行训练。因此,将不会使用y = b + a1x1 + a2x2 + ... a5x5计算X_1(n,10)
而且,我认为没有必要在onehotencoder.fit_transform()之后进行数组排列
 我不确定我的回答是否可以解决您的麻烦,但我希望如此。