Question

从pandas DataFrame开始，d_train（774行）：

我们的想法是遵循示例here来研究岭系数路径。

在该示例中，以下是变量类型：

X, y, w = make_regression(n_samples=10, n_features=10, coef=True,
                          random_state=1, bias=3.5)
print X.shape, type(X), y.shape, type(y), w.shape, type(w)

>> (10, 10) <type 'numpy.ndarray'> (10,) <type 'numpy.ndarray'> (10,) <type'numpy.ndarray'>

为了避免this stackoverflow discussion中提到的问题，我将所有内容转换为numpy数组：

predictors = ['p1', 'p2', 'p3', 'p4']
target = ['target_bins']
X = d_train[predictors].as_matrix()
### X = np.transpose(d_train[predictors].as_matrix())
y = d_train['target_bins'].as_matrix()
w = numpy.full((774,), 3, dtype=float)
print X.shape, type(X), y.shape, type(y), w.shape, type(w)
>> (774, 4) <type 'numpy.ndarray'> y_shape: (774,) <type 'numpy.ndarray'>     w_shape: (774,) <type 'numpy.ndarray'>

然后我跑了（a）示例中的确切代码，（b）将参数fit_intercept = True, normalize = True添加到岭调用（我的数据未缩放）得到相同的错误信息：

my_ridge = Ridge()
coefs = []
errors = []
alphas = np.logspace(-6, 6, 200)

for a in alphas:
    my_ridge.set_params(alpha=a, fit_intercept = True, normalize = True)
    my_ridge.fit(X, y)
    coefs.append(my_ridge.coef_)
    errors.append(mean_squared_error(my_ridge.coef_, w))
>> ValueError: Found input variables with inconsistent numbers of samples: [4, 774]

正如代码中注释掉的部分所示，我也尝试过＆＃34;相同的＆＃34;代码，但具有转置的X矩阵。在创建X matrix之前，我还尝试扩展数据。得到了相同的错误消息。

最后，我使用＆＃39; RidgeClassifier＆＃39;做了同样的事情，并且管理以获得不同的错误消息。

>> Found input variables with inconsistent numbers of samples: [1, 774]

问题：我不知道这里发生了什么 - 你能帮忙吗？

在Canopy 1.7.4.3348（64位）上使用python 2.7，使用scikit-learn 18.01-3和pandas 0.19.2-2

谢谢。

Answer 1

你需要拥有尽可能多的权重w，因为你有多个特征（因为你学习了每个特征的单个权重），但在你的代码中，权重向量的维数是774（这是行数）在训练数据集中），这就是为什么它不起作用。将代码修改为以下代码（改为4个权重），一切都会起作用：

w = np.full((4,), 3, dtype=float) # number of features = 4, namely p1, p2, p3, p4
print X.shape, type(X), y.shape, type(y), w.shape, type(w)
#(774L, 4L) <type 'numpy.ndarray'> (774L,) <type 'numpy.ndarray'> (4L,) <type 'numpy.ndarray'>

现在，您可以运行http://scikit-learn.org/stable/auto_examples/linear_model/plot_ridge_coeffs.html#sphx-glr-auto-examples-linear-model-plot-ridge-coeffs-py中的其余代码，以查看权重和误差如何随正则化参数alpha的网格搜索而变化，并获得以下数据

Scikit-learn中岭回归的系数路径

1 个答案: