预期结果和我想做的事情摘要:
1. From one list, I have created one another list based on frequency, so we have two lists: `Original list` and `Frequent items list` ( frequent item list is subset of original list)
2. Calculating TFIDF vector for each elements in both list.
3. For each items in `Frequent items list` , I have to get cosine similarity for each items in `Original list`
4. If cosine similarity is greater then some threshold then I will add that element into set otherwise not.
5. So, I want a dictionary having keys as each elements of `Frequent items list` and value is set of elements having cosine similarity greater than some threshold.
使用TFIDF矢量化器,我为列表计算了矢量,并且由于矢量的大小不同,我无法获得余弦相似度。
以下是原始列表计算频率分布的代码:
import nltk
bigram_freq_dist = nltk.FreqDist(original_list)
结果:
FreqDist({'time picked': 8, 'picked drop': 7, 'bus good': 5, 'good bus': 5, 'best service': 4, 'rest stop': 4, 'comfortable journey': 4, 'good service': 4, 'everything good': 3, 'staff behaviour': 3, ...})
根据该频率分布字典,我处理了一些频率> 2的项目:
bi_vector_list = []
for name, freq in bigram_freq_dist.items():
if freq >= 2:
bi_vector_list.append(name)
计算出的二元组的TFIDF向量:
#for top frequency element
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range=(2,2))
vectors = vectorizer.fit_transform(bi_vector_list)
bivectors = vectors.toarray()
形状:
bivectors.shape = (23,23)
对于所有元素:
#for all element
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer_all = TfidfVectorizer(ngram_range=(2,2))
vectors_all = vectorizer_all.fit_transform(original_list)
bivectors_all = vectors_all.toarray()
形状原始列表:
bivectors_all.shape = (1170, 1071)
计算余弦相似度:
#cosine similarity
from sklearn.metrics.pairwise import cosine_similarity
bi_sim_dict = dict()
for i in range(len(vectorizer.get_feature_names())):
local_list = set()
for vector in bivectors[0]:
for j in range(len(vectorizer_all.get_feature_names())):
for element_vector in bivector_elements[0]:
if cosine_similar(vector,element_vector) > 0.7:
local_list.add(vectorizer.get_feature_names()[j])
bi_sim_dict[vectorizer.get_feature_names()[i]] = local_list
余弦相似度:
def cosine_similar(vector1,vector2):
similarity = cosine_similarity([vector1,vector2])
return similarity
出现形状错误,我知道为什么会出现错误。