Python 2.7 / Scikit学习0.17线性回归错误:ValueError:找到样本数不一致的数组:[1 343]

时间:2016-02-17 08:49:59

标签: python-2.7 pandas scikit-learn

我正在阅读一篇csv并尝试根据df ['LSTAT'](x /变量)v。进行线性回归模型。 df ['MEDV'](y /目标)。但是,错误消息“ValueError:Found sample with samples number of samples:[1 343]”在模型拟合阶段不断弹出。

我对数据进行整形/重塑(不确定我是否已正确完成)或将pd.DataFrame转换为numpy数组和列表。它们都不起作用。阅读这篇文章后,我仍然不太明白这个问题:sklearn: Found arrays with inconsistent numbers of samples when calling LinearRegression.fit()。脚本和错误消息如下。

任何一位大师能否提供一些详细解释的解决方案?谢谢!

import scipy.stats as stats
import pylab 
import numpy as np
import matplotlib.pyplot as plt
import pylab as pl
import sklearn
from sklearn.cross_validation import train_test_split
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression


df=pd.read_csv("input.csv")


X_train1, X_test1, y_train1, y_test1 = train_test_split(df['LSTAT'],df['MEDV'],test_size=0.3,random_state=1)

lin=LinearRegression()

################## This line: " lin_train=lin.fit(X_train1,y_train1)" causes the trouble. 

lin_train=lin.fit(X_train1,y_train1)

################## The followings are just the plotting lines after fitting the Linear regression

# The coefficients
print('Coefficients: \n', lin.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
      % np.mean((lin.predict(X_test1) - y_test1) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % lin.score(X_test1, y_test1))

# Plot outputs
plt.scatter(X_test1, y_test1,  color='black')
plt.plot(X_test1, lin.predict(X_test1), color='blue',linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()

这是警告&错误讯息:

Warning (from warnings module):
  File "C:\Python27\Lib\site-packages\sklearn\utils\validation.py", line 386
    DeprecationWarning)
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.

Traceback (most recent call last):
  File "C:/Users/Pin-Chih/Google Drive/Real_estate_projects/test.py", line 36, in <module>
    lin_train=lin.fit(X_train1,y_train1)
  File "C:\Python27\Lib\site-packages\sklearn\linear_model\base.py", line 427, in fit
    y_numeric=True, multi_output=True)
  File "C:\Python27\Lib\site-packages\sklearn\utils\validation.py", line 520, in check_X_y
    check_consistent_length(X, y)
  File "C:\Python27\Lib\site-packages\sklearn\utils\validation.py", line 176, in check_consistent_length
    "%s" % str(uniques))
ValueError: Found arrays with inconsistent numbers of samples: [  1 343]>>> 

如果我打印出“x_train1”:

X_train1:  
61     26.82
294    12.86
39     29.29
458     4.85
412     8.05
Name: LSTAT, dtype: float64

如果我打印出“y_train1”:

y_train1:  
61     13.4
294    22.5
39     11.8
458    35.1
412    29.0
Name: MEDV, dtype: float64

1 个答案:

答案 0 :(得分:1)

当然不是大师,但我过去也遇到过类似的问题,因为模型期望X参数至少有2个维度,即使第二个维度为1.我要尝试的第一件事就是替换

lin_train=lin.fit(X_train1,y_train1)

lin_train=lin.fit(X_train1.reshape(X_train1.shape[0], 1), y_train1)

应该为您提供形状(343,1)而不仅仅是343的数据。