我一直在开发一个python脚本来分类文章是否与正文相关。为此,我一直在使用具有一些功能的ML(SVM分类器),包括字嵌入的平均值。
用于计算文章和正文列表之间的单词嵌入平均值的代码如下:
word2vec_model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
setm = set(word2vec_model.index2word)
def avg_feature_vector(words, model, num_features, index2word_set):
#function to average all words vectors in a given paragraph
featureVec = np.zeros((num_features,), dtype="float32")
nwords = 0
for word in words:
if word in index2word_set and word not in stop:
try:
featureVec = np.add(featureVec, model[word])
nwords = nwords+1
except:
pass
if(nwords>0):
featureVec = np.divide(featureVec, nwords)
return featureVec
def doc_similatiry(headlines, bodies):
X = []
docs = []
for i, (headline, body) in tqdm(enumerate(zip(headlines, bodies))):
headline_avg_vector = avg_feature_vector(lemmatize_str(clean(headline)).split(), word2vec_model, 300, setm)
body_avg_vector = avg_feature_vector(lemmatize_str(clean(body)).split(), word2vec_model, 300, setm)
similarity = 1 - distance.cosine(headline_avg_vector, body_avg_vector)
X.append(similarity)
return X, docs
似乎正确计算了平均的word2vec。然而,它比单独的TF-IDF余弦更差。因此,我的想法是将这两个特征分组,通过这种方式,将每个单词的TF-IDF分数乘以word2vec。
这是我的代码:
def avg_feature_vector(words, model, num_features, index2word_set, tfidf_vec, vec_repr, pos):
#function to average all words vectors in a given paragraph (with tfidf feature)
featureVec = np.zeros((num_features,), dtype="float32")
nwords = 0
for word in words:
if word in index2word_set and word not in stop:
try:
a = tfidf_vec.vocabulary_[word]
featureVec = np.add(featureVec, model[word]) * vec_repr[pos, a]
nwords = nwords+1
except:
pass
if(nwords>0):
featureVec = np.divide(featureVec, nwords)
return featureVec
def doc_similatiry_with_tfidf(headlines, bodies):
X = []
docs = []
for i, (headline, body) in tqdm(enumerate(zip(headlines, bodies))):
docs.append(lemmatize_str(clean(headline)))
docs.append(lemmatize_str(clean(body)))
vectorizer = TfidfVectorizer(norm='l2',min_df=0, use_idf=True, smooth_idf=True, stop_words=stop, sublinear_tf=True)
sklearn_representation = vectorizer.fit_transform(docs)
for i, (headline, body) in tqdm(enumerate(zip(headlines, bodies))):
a = (clean(headline))
headline_avg_vector = avg_feature_vector(nltk.word_tokenize(a), word2vec_model, 300, setm, vectorizer, sklearn_representation, 2*i)
a = (clean(body))
body_avg_vector = avg_feature_vector(nltk.word_tokenize(a), word2vec_model, 300, setm, vectorizer, sklearn_representation, 2*i+1)
similarity = 1 - distance.cosine(headline_avg_vector, body_avg_vector)
X.append(similarity)
return X, docs
我的问题是这种方法的结果很糟糕,而且我不知道是否有一些逻辑解释了这一点(因为理论上它应该有更好的结果)或者我是不是在我的代码中做错了。
任何人都可以帮我解决这个问题吗?此外,我愿意接受新的解决方案来解决这个问题。
注意:我在那里使用了一些函数,因为我认为没有必要,所以我没有发布代码。如果有什么东西你不明白我会在这里更好地解释它。