Question

我一直在开发一个python脚本来分类文章是否与正文相关。为此，我一直在使用具有一些功能的ML（SVM分类器），包括字嵌入的平均值。

用于计算文章和正文列表之间的单词嵌入平均值的代码如下：

word2vec_model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
setm = set(word2vec_model.index2word)

def avg_feature_vector(words, model, num_features, index2word_set):
        #function to average all words vectors in a given paragraph 
        featureVec = np.zeros((num_features,), dtype="float32")
        nwords = 0
        for word in words:
            if word in index2word_set and word not in stop:
                try:
                    featureVec = np.add(featureVec, model[word])
                    nwords = nwords+1
                except:
                    pass
        if(nwords>0):
            featureVec = np.divide(featureVec, nwords)
        return featureVec

def doc_similatiry(headlines, bodies):
    X = []
    docs = []
    for i, (headline, body) in tqdm(enumerate(zip(headlines, bodies))):
        headline_avg_vector = avg_feature_vector(lemmatize_str(clean(headline)).split(), word2vec_model, 300, setm)
        body_avg_vector = avg_feature_vector(lemmatize_str(clean(body)).split(), word2vec_model, 300, setm)
        similarity =  1 - distance.cosine(headline_avg_vector, body_avg_vector)
        X.append(similarity)
    return X, docs

似乎正确计算了平均的word2vec。然而，它比单独的TF-IDF余弦更差。因此，我的想法是将这两个特征分组，通过这种方式，将每个单词的TF-IDF分数乘以word2vec。

这是我的代码：

def avg_feature_vector(words, model, num_features, index2word_set, tfidf_vec, vec_repr, pos):
        #function to average all words vectors in a given paragraph (with tfidf feature)
        featureVec = np.zeros((num_features,), dtype="float32")
        nwords = 0

        for word in words:
            if word in index2word_set and word not in stop:
                try:
                    a = tfidf_vec.vocabulary_[word]
                    featureVec = np.add(featureVec, model[word]) * vec_repr[pos, a]
                    nwords = nwords+1
                except:
                    pass    
        if(nwords>0):
            featureVec = np.divide(featureVec, nwords)
        return featureVec

def doc_similatiry_with_tfidf(headlines, bodies):

    X = []
    docs = []
    for i, (headline, body) in tqdm(enumerate(zip(headlines, bodies))):
        docs.append(lemmatize_str(clean(headline)))
        docs.append(lemmatize_str(clean(body)))
    vectorizer = TfidfVectorizer(norm='l2',min_df=0, use_idf=True, smooth_idf=True, stop_words=stop, sublinear_tf=True)
    sklearn_representation = vectorizer.fit_transform(docs)

    for i, (headline, body) in tqdm(enumerate(zip(headlines, bodies))):
        a = (clean(headline))
        headline_avg_vector = avg_feature_vector(nltk.word_tokenize(a), word2vec_model, 300, setm, vectorizer, sklearn_representation, 2*i)
        a = (clean(body))
        body_avg_vector = avg_feature_vector(nltk.word_tokenize(a), word2vec_model, 300, setm, vectorizer, sklearn_representation, 2*i+1)

        similarity =  1 - distance.cosine(headline_avg_vector, body_avg_vector)
        X.append(similarity)

    return X, docs

我的问题是这种方法的结果很糟糕，而且我不知道是否有一些逻辑解释了这一点（因为理论上它应该有更好的结果）或者我是不是在我的代码中做错了。

任何人都可以帮我解决这个问题吗？此外，我愿意接受新的解决方案来解决这个问题。

注意：我在那里使用了一些函数，因为我认为没有必要，所以我没有发布代码。如果有什么东西你不明白我会在这里更好地解释它。

TF-IDF得分

0 个答案: