在文本分类中使用最近邻(scikit)时,有时与任何类都不相似。发生这种情况时,scikit算法返回的距离为1,并且似乎选择了一个看似随机的类(每次运行中都相同,但有时再次运行时会更改)。当向量正交时返回特定值(如None)会很有帮助。
vec = CountVectorizer(strip_accents='ascii', stop_words = stopwords, ngram_range=(1, 3))
bag_of_words = vec.fit_transform(list(map(str, Property))) #reference
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(bag_of_words.minimum(1))
neigh = NearestNeighbors(n_neighbors = neighbors)
neigh.fit(X_train_tfidf)
X_test_counts = vec.transform(wines_strings).minimum(1)
res = neigh.kneighbors(X_test_counts, return_distance = True)
答案 0 :(得分:0)
我决定只添加一个计算以确定向量是否正交。当它们正交时,我会无视最近的邻居吐出来的东西
a = X_train_tfidf@X_test_counts.transpose()
indicator = a.transpose()*np.ones(a.get_shape()[0])