稀疏矩阵的分类属性

时间:2017-09-25 10:39:24

标签: machine-learning scikit-learn sparse-matrix

首先,我是机器学习的新手。

我试图预测二手车的价格。这车有品牌和型号,所以我使用MultiLabelBinarizer制作稀疏矩阵,处理分类属性,这里是代码:

from sklearn.preprocessing import MultiLabelBinarizer
encoder = MultiLabelBinarizer()
make_cat_1hot = encoder.fit_transform(make_cat)
model_cat_1hot = encoder.fit_transform(model_cat)
type_cat_1hot = encoder.fit_transform(type_cat)

print(type(make_cat_1hot))
carInfoModHot = carsInfoMod.copy()
carInfoModHot["makeHot"] = make_cat_1hot.tolist()
carInfoModHot["modelHot"] = model_cat_1hot.tolist()
carInfoModHot["typeHot"] = type_cat_1hot.tolist()



doors   km      make        year    makeHot                       modelHot  
5.0     78779   Mercedes    2012    [0, 0, 0,  0, 1, 0, 0, 0, ...[1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, ...  
5.0     25463   Bmw         2015    [0, 1, 0, 0, 0, 0, 0, ...   [1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, ...   

然后我用它来做预测并用线性回归得到均方误差:

lr = linear_model.LinearRegression()

carsInfoTrainHot = carInfoModHot.drop(["price"], axis=1) # drop labels for training set

df1 = carsInfoTrainHot.iloc[:30000, :]
carsLabels1 = carsInfoMod.iloc[:30000, 3]
print(carsInfoTrainHot.head())
df2 = carsInfoTrainHot.iloc[30001:60000, :]
carsLabels2 = carsInfoMod.iloc[30001:60000, 3]
df3 = carsInfoTrainHot.iloc[60001:, :]
carsLabels3 = carsInfoMod.iloc[60001:, 3]

lr.fit(df1, carsLabels1) 
print(carsInfoTrainHot.shape)
carPrediction = lr.predict(df2)

lin_mse = mean_squared_error(carsLabels2, carPrediction)

lin_rmse = np.sqrt(lin_mse)

但是我收到了这个错误:

        

ValueError Traceback(最近一次调用   最后)in()        12辆车标签3 = carsInfoMod.iloc [60001:,3]        13   ---> 14 lr.fit(df1,carsLabels1)        15打印(carsInfoTrainHot.shape)        16 carPrediction = lr.predict(df2)

     

/home/vagrant/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/base.py   适合(自我,X,y,sample_weight)       510 n_jobs_ = self.n_jobs       511 X,y = check_X_y(X,y,accept_sparse = ['csr','csc','coo'],    - > 512 y_numeric = True,multi_output = True)       513       514如果sample_weight不是None,则为np.atleast_1d(sample_weight).ndim> 1:

     

/home/vagrant/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py   在check_X_y(X,y,accept_sparse,dtype,order,copy,   force_all_finite,ensure_2d,allow_nd,multi_output,   ensure_min_samples,ensure_min_features,y_numeric,warn_on_dtype,   估计)       519 X = check_array(X,accept_sparse,dtype,order,copy,force_all_finite,       520 ensure_2d,allow_nd,ensure_min_samples,    - > 521 ensure_min_features,warn_on_dtype,estimator)       522 if multi_output:       523 y = check_array(y,'csr',force_all_finite = True,ensure_2d = False,

     

/home/vagrant/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py   在check_array(array,accept_sparse,dtype,order,copy,   force_all_finite,ensure_2d,allow_nd,ensure_min_samples,   ensure_min_features,warn_on_dtype,estimator)       400#确保我们实际转换为数字:       401如果dtype_numeric和array.dtype.kind ==“O”:    - > 402 array = array.astype(np.float64)       403如果不是allow_nd和array.ndim> = 3:       404引发ValueError(“找到dim%d。%s预期的数组< = 2。”

     

ValueError:使用序列设置数组元素。

据我所知,我在分类属性中插入一个数组,但我怎样才能将分类值更改为稀疏矩阵?

感谢。

0 个答案:

没有答案