已更新
我正在研究机器学习文本分类和m使用svc线性内核整个代码工作除了最后一行代码(print(svm_model_linear.predict_proba(test))实际上是建立一个分类器,其中有3类骑自行车,足球和羽毛球,我有一些facebook状态的人被标记为这些类别我已经训练分类器测试也使用train_test_split和此后我有一些状态没有标记,我想分类他们但最后一行代码给我错误
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 700)
X = cv.fit_transform(corpus).toarray()
print X
y = dataset.iloc[:, 1].values
print y
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =
0.20, random_state = 0)
from sklearn.svm import SVC
svm_model_linear = SVC(kernel ='linear', C = 1,
probability=True).fit(X_train, y_train)
svm_predictions = svm_model_linear.predict(X_test)
# model accuracy for X_test
accuracy = svm_model_linear.score(X_test, y_test)
#creating a confusion matrix
cm = confusion_matrix(y_test, svm_predictions)
未标记数据的分类从这里开始
data = pd.read_csv('sentence.csv', delimiter = '\t', quoting = 3)
test = []
for j in range(0, 5):
review = re.sub('[^a-zA-Z]', ' ', data['Sentence'][j])
review = review.lower()
review = review.split()
ps = PorterStemmer()
review = [ps.stem(word) for word in review if not word in
set(stopwords.words('english'))]
review = ' '.join(review)
test.append(review)
pred = cv.fit_transform(test).toarray()
print (svm_model_linear.predict_proba(test))
错误
print (svm_model_linear.predict_proba(test))
Traceback (most recent call last):
File "<ipython-input-7-5fa676a0fc00>", line 1, in <module>
print (svm_model_linear.predict_proba(test))
File "/home/letsperf/.local/lib/python2.7/site-packages/sklearn/svm/base.py", line 594, in _predict_proba
X = self._validate_for_predict(X)
File "/home/letsperf/.local/lib/python2.7/site-packages/sklearn/svm/base.py", line 439, in _validate_for_predict
X = check_array(X, accept_sparse='csr', dtype=np.float64, order="C")
File "/home/letsperf/.local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 402, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: X.shape[1] = 15 should be equal to 700, the number of features at training time
答案 0 :(得分:2)
Scikit估算器不对字符串起作用,只对数字数据起作用。您的训练部分成功完成,因为您已使用CountVectorizer将语料库从字符串转换为数字。你没有为测试数据做这件事。
您需要在测试数据上调用<select id="timezone" name="timezone" [(ngModel)]="activeItem.timezone">
<option [ngValue]="activeItem.timezone" [selected]="true" disabled hidden>{{activeItem.timezone.region}}</option>
<option *ngFor="let timeZone of timeZones"
[ngValue]="{timeZoneHolder: timeZone.countryName, region: timeZone.timeZone, UTC: timeZone.UTC}">
{{timeZone.timeZone}}
</option>
,使其类似于用于训练模型的X.只有这样,它才会成功并具有某种意义。
另外,请确保使用相同的cv.tranform(test)
对象,将原始列车cv
转换为数字格式。
更新
您对测试数据不corpus
,始终只按我上面的建议调用fit_transform()
。你目前正在做的是:
transform()
会忘记之前的训练并重新计算将改变pred = cv.fit_transform(test).toarray()
形状的计数矢量化器。将其更改为:
pred