我对机器学习很陌生,我正在尝试对公共葡萄酒数据集进行实验。 我最终得到一个错误,我找不到解决方案。
以下是我正在尝试使用的模型:
X = data_all[['country', 'description', 'price', 'province', 'variety']]
y = data_all['points']
# Vectorizing Description column (text analysis)
vectorizerDesc = CountVectorizer()
descriptions = X['description']
vectorizerDesc.fit(descriptions)
vectorizedDesc = vectorizer.transform(X['description'])
X['description'] = vectorizedDesc
# Categorizing other string columns
X = pd.get_dummies(X, columns=['country', 'province', 'variety'])
# Generating train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
# Multinomial Naive Bayes
nb = MultinomialNB()
nb.fit(X_train, y_train)
这是在调用train_test_split
之前X看起来的样子:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 83945 entries, 25 to 150929
Columns: 837 entries, description to variety_Zweigelt
dtypes: float64(1), object(1), uint8(835)
最后一行(nb.fit)给了我一个错误:
ValueError Traceback (most recent call last)
<ipython-input-197-9d40e4624ff6> in <module>()
3 # Multinomial Naive Bayes is a specialised version of Naive Bayes designed more for text documents
4 nb = MultinomialNB()
----> 5 nb.fit(X_train, y_train)
/opt/conda/lib/python3.6/site-packages/sklearn/naive_bayes.py in fit(self, X, y, sample_weight)
577 Returns self.
578 """
--> 579 X, y = check_X_y(X, y, 'csr')
580 _, n_features = X.shape
581
/opt/conda/lib/python3.6/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
571 X = check_array(X, accept_sparse, dtype, order, copy, force_all_finite,
572 ensure_2d, allow_nd, ensure_min_samples,
--> 573 ensure_min_features, warn_on_dtype, estimator)
574 if multi_output:
575 y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,
/opt/conda/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
446 # make sure we actually converted to numeric:
447 if dtype_numeric and array.dtype.kind == "O":
--> 448 array = array.astype(np.float64)
449 if not allow_nd and array.ndim >= 3:
450 raise ValueError("Found array with dim %d. %s expected <= 2."
ValueError: setting an array element with a sequence.
您是否知道如何在多项NB算法中将我的矢量化文本分析和其他数据集(如国家等...)结合起来?
提前谢谢你:)