首先,我是机器学习的新手。
我试图预测二手车的价格。这车有品牌和型号,所以我使用MultiLabelBinarizer制作稀疏矩阵,处理分类属性,这里是代码:
from sklearn.preprocessing import MultiLabelBinarizer
encoder = MultiLabelBinarizer()
make_cat_1hot = encoder.fit_transform(make_cat)
model_cat_1hot = encoder.fit_transform(model_cat)
type_cat_1hot = encoder.fit_transform(type_cat)
print(type(make_cat_1hot))
carInfoModHot = carsInfoMod.copy()
carInfoModHot["makeHot"] = make_cat_1hot.tolist()
carInfoModHot["modelHot"] = model_cat_1hot.tolist()
carInfoModHot["typeHot"] = type_cat_1hot.tolist()
doors km make year makeHot modelHot
5.0 78779 Mercedes 2012 [0, 0, 0, 0, 1, 0, 0, 0, ...[1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, ...
5.0 25463 Bmw 2015 [0, 1, 0, 0, 0, 0, 0, ... [1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, ...
然后我用它来做预测并用线性回归得到均方误差:
lr = linear_model.LinearRegression()
carsInfoTrainHot = carInfoModHot.drop(["price"], axis=1) # drop labels for training set
df1 = carsInfoTrainHot.iloc[:30000, :]
carsLabels1 = carsInfoMod.iloc[:30000, 3]
print(carsInfoTrainHot.head())
df2 = carsInfoTrainHot.iloc[30001:60000, :]
carsLabels2 = carsInfoMod.iloc[30001:60000, 3]
df3 = carsInfoTrainHot.iloc[60001:, :]
carsLabels3 = carsInfoMod.iloc[60001:, 3]
lr.fit(df1, carsLabels1)
print(carsInfoTrainHot.shape)
carPrediction = lr.predict(df2)
lin_mse = mean_squared_error(carsLabels2, carPrediction)
lin_rmse = np.sqrt(lin_mse)
但是我收到了这个错误:
ValueError Traceback(最近一次调用 最后)in() 12辆车标签3 = carsInfoMod.iloc [60001:,3] 13 ---> 14 lr.fit(df1,carsLabels1) 15打印(carsInfoTrainHot.shape) 16 carPrediction = lr.predict(df2)
/home/vagrant/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/base.py 适合(自我,X,y,sample_weight) 510 n_jobs_ = self.n_jobs 511 X,y = check_X_y(X,y,accept_sparse = ['csr','csc','coo'], - > 512 y_numeric = True,multi_output = True) 513 514如果sample_weight不是None,则为np.atleast_1d(sample_weight).ndim> 1:
/home/vagrant/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py 在check_X_y(X,y,accept_sparse,dtype,order,copy, force_all_finite,ensure_2d,allow_nd,multi_output, ensure_min_samples,ensure_min_features,y_numeric,warn_on_dtype, 估计) 519 X = check_array(X,accept_sparse,dtype,order,copy,force_all_finite, 520 ensure_2d,allow_nd,ensure_min_samples, - > 521 ensure_min_features,warn_on_dtype,estimator) 522 if multi_output: 523 y = check_array(y,'csr',force_all_finite = True,ensure_2d = False,
/home/vagrant/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py 在check_array(array,accept_sparse,dtype,order,copy, force_all_finite,ensure_2d,allow_nd,ensure_min_samples, ensure_min_features,warn_on_dtype,estimator) 400#确保我们实际转换为数字: 401如果dtype_numeric和array.dtype.kind ==“O”: - > 402 array = array.astype(np.float64) 403如果不是allow_nd和array.ndim> = 3: 404引发ValueError(“找到dim%d。%s预期的数组< = 2。”
ValueError:使用序列设置数组元素。
据我所知,我在分类属性中插入一个数组,但我怎样才能将分类值更改为稀疏矩阵?
感谢。