您好我想根据他们的标题来分组电影。我的功能对我的数据非常有用,但是我有一个很大的问题,我的样本是150.000大电影而且非常慢,实际上需要3天来集中所有电影
过程:
根据影片长度
对影片进行分类使用countvectorizer转换影片并计算每个影片的相似度(对于每次适合矢量图的每个聚集影片,我都会转换目标影片)
def product_similarity( clustered_movie, target_movie ):
'''
Calculates the title distance of 2 movies based on title
'''
# fitted vectorizer is a dictionary with fitted movies if wee dont fit to
# vectorizer the movie it fits and save it to dictionary
if clustered_movie in fitted_vectorizer:
vectorizer = fitted_vectorizer[clustered_movie]
a = vectorizer.transform([clustered_movie]).toarray()
b = vectorizer.transform( [target_movie] ).toarray()
similarity = cosine_similarity( a, b )
else:
clustered_movie = re.sub("[0-9]|[^\w']|[_]", " ",clustered_product )
vectorizer = CountVectorizer(stop_words=None)
vectorizer = vectorizer.fit([clustered_movie])
fitted_vectorizer[clustered_movie] = vectorizer
a = vectorizer.transform([clustered_movie]).toarray()
b = vectorizer.transform( [target_movie] ).toarray()
similarity = cosine_similarity( a, b )
return similarity[0][0]
答案 0 :(得分:0)
在所有标题上一次适合CountVectorizer。保存模型。然后使用拟合模型进行变换。