ValueError:X.shape [1] = 15应该等于700,即训练时的特征数量

时间:2017-11-03 11:41:45

标签: python machine-learning scikit-learn

已更新

我正在研究机器学习文本分类和m使用svc线性内核整个代码工作除了最后一行代码(print(svm_model_linear.predict_proba(test))实际上是建立一个分类器,其中有3类骑自行车,足球和羽毛球,我有一些facebook状态的人被标记为这些类别我已经训练分类器测试也使用train_test_split和此后我有一些状态没有标记,我想分类他们但最后一行代码给我错误

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 700)
X = cv.fit_transform(corpus).toarray()
print X
y = dataset.iloc[:, 1].values
print y

# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 
0.20, random_state = 0)


from sklearn.svm import SVC
svm_model_linear = SVC(kernel ='linear', C = 1, 
probability=True).fit(X_train, y_train)
svm_predictions = svm_model_linear.predict(X_test)



# model accuracy for X_test  
accuracy = svm_model_linear.score(X_test, y_test)
#creating a confusion matrix
cm = confusion_matrix(y_test, svm_predictions)

未标记数据的分类从这里开始

data = pd.read_csv('sentence.csv', delimiter = '\t', quoting = 3)

test = []
for j in range(0, 5):
    review = re.sub('[^a-zA-Z]', ' ', data['Sentence'][j])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in 
    set(stopwords.words('english'))]
    review = ' '.join(review)
    test.append(review)
pred = cv.fit_transform(test).toarray()
print (svm_model_linear.predict_proba(test))

错误

print (svm_model_linear.predict_proba(test))

Traceback (most recent call last):

  File "<ipython-input-7-5fa676a0fc00>", line 1, in <module>
print (svm_model_linear.predict_proba(test))

  File "/home/letsperf/.local/lib/python2.7/site-packages/sklearn/svm/base.py", line 594, in _predict_proba
X = self._validate_for_predict(X)

  File "/home/letsperf/.local/lib/python2.7/site-packages/sklearn/svm/base.py", line 439, in _validate_for_predict
X = check_array(X, accept_sparse='csr', dtype=np.float64, order="C")

  File "/home/letsperf/.local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 402, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)

ValueError: X.shape[1] = 15 should be equal to 700, the number of features at training time

1 个答案:

答案 0 :(得分:2)

Scikit估算器不对字符串起作用,只对数字数据起作用。您的训练部分成功完成,因为您已使用CountVectorizer将语料库从字符串转换为数字。你没有为测试数据做这件事。

您需要在测试数据上调用<select id="timezone" name="timezone" [(ngModel)]="activeItem.timezone"> <option [ngValue]="activeItem.timezone" [selected]="true" disabled hidden>{{activeItem.timezone.region}}</option> <option *ngFor="let timeZone of timeZones" [ngValue]="{timeZoneHolder: timeZone.countryName, region: timeZone.timeZone, UTC: timeZone.UTC}"> {{timeZone.timeZone}} </option> ,使其类似于用于训练模型的X.只有这样,它才会成功并具有某种意义。

另外,请确保使用相同的cv.tranform(test)对象,将原始列车cv转换为数字格式。

更新

您对测试数据不corpus,始终只按我上面的建议调用fit_transform()。你目前正在做的是:

transform()

会忘记之前的训练并重新计算将改变pred = cv.fit_transform(test).toarray() 形状的计数矢量化器。将其更改为:

pred