如何在词袋上做K-NN

时间:2018-12-15 18:12:19

标签: python text-mining

我有一个训练和测试集(大小相等)。我已经完成了单词袋模型,并且尝试在其上做K近邻,但不确定如何进行拟合。

单词袋模型:

from sklearn.feature_extraction.text import CountVectorizer
bow_vectorizer = CountVectorizer(max_features=100, stop_words='english')

bow = bow_vectorizer.fit(TrainData)
print(bow_vectorizer.vocabulary_)
bowTrain = bow_vectorizer.fit_transform(TrainData)
bowTest = bow_vectorizer.fit_transform(TestData)

尝试在“语言袋”模型上进行KNN,但我不确定应该在“ knn.fit”部分中添加什么

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(bowTrain, ???? )
predict = knn.predict(bowTest[0:5000])

1 个答案:

答案 0 :(得分:0)

from sklearn.feature_extraction.text import CountVectorizer
bow_vectorizer = CountVectorizer(max_features=100, stop_words='english')

X_train = TrainData
#y_train = your array of labels goes here
bowVect = bow_vectorizer.fit(X_train)

您可能应该使用相同的矢量化程序,因为词汇可能会发生变化。

bowTrain = bowVect.transform(X)
bowTest = bowVect.transform(TestData)

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(bowTrain, y_train )
predict = knn.predict(bowTest[0:5000])