计算相同文档但长度不同的TFIDF矢量化器的向量的余弦相似度

时间:2019-04-12 06:21:24

标签: python-3.x nlp tf-idf tfidfvectorizer

预期结果和我想做的事情摘要:

1. From one list, I have created one another list based on frequency, so we have two lists: `Original list` and `Frequent items list` ( frequent item list is subset of original list)
2. Calculating TFIDF vector for each elements in both list.
3. For each items in `Frequent items list` , I have to get cosine similarity for each items in `Original list`
4. If cosine similarity is greater then some threshold then I will add that element into set otherwise not.
5. So, I want a dictionary having keys as each elements of  `Frequent items list` and value is set of elements having cosine similarity greater than some threshold.

使用TFIDF矢量化器,我为列表计算了矢量,并且由于矢量的大小不同,我无法获得余弦相似度。

以下是原始列表计算频率分布的代码:

import nltk
bigram_freq_dist = nltk.FreqDist(original_list)

结果:

FreqDist({'time picked': 8, 'picked drop': 7, 'bus good': 5, 'good bus': 5, 'best service': 4, 'rest stop': 4, 'comfortable journey': 4, 'good service': 4, 'everything good': 3, 'staff behaviour': 3, ...})

根据该频率分布字典,我处理了一些频率> 2的项目:

bi_vector_list = []

for name, freq in bigram_freq_dist.items():
    if freq >= 2:
        bi_vector_list.append(name)

计算出的二元组的TFIDF向量:

#for top frequency element
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range=(2,2))
vectors = vectorizer.fit_transform(bi_vector_list)
bivectors = vectors.toarray()

形状:

bivectors.shape = (23,23)

对于所有元素:

#for all element
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer_all = TfidfVectorizer(ngram_range=(2,2))
vectors_all = vectorizer_all.fit_transform(original_list)
bivectors_all = vectors_all.toarray()

形状原始列表:

bivectors_all.shape  = (1170, 1071)

计算余弦相似度:

#cosine similarity
from sklearn.metrics.pairwise import cosine_similarity

bi_sim_dict = dict()
for i in range(len(vectorizer.get_feature_names())):
    local_list = set()
    for vector in bivectors[0]:
            for j in range(len(vectorizer_all.get_feature_names())):
                for element_vector in bivector_elements[0]:
                    if cosine_similar(vector,element_vector) > 0.7:
                        local_list.add(vectorizer.get_feature_names()[j])

    bi_sim_dict[vectorizer.get_feature_names()[i]] = local_list

余弦相似度:

def cosine_similar(vector1,vector2):
    similarity = cosine_similarity([vector1,vector2])
    return similarity

出现形状错误,我知道为什么会出现错误。

0 个答案:

没有答案