为大型数据集更快地使CountVectorizer

时间:2017-10-31 08:52:18

标签: python-3.x performance scikit-learn countvectorizer

您好我想根据他们的标题来分组电影。我的功能对我的数据非常有用,但是我有一个很大的问题,我的样本是150.000大电影而且非常慢,实际上需要3天来集中所有电影

过程:

根据影片长度

对影片进行分类

使用countvectorizer转换影片并计算每个影片的相似度(对于每次适合矢量图的每个聚集影片,我都会转换目标影片)

def product_similarity( clustered_movie, target_movie ):

'''
Calculates the title distance of 2 movies based on title
'''
# fitted vectorizer is a dictionary with fitted movies if wee dont fit to 
# vectorizer the movie it fits and save it to dictionary

if clustered_movie in fitted_vectorizer: 
    vectorizer = fitted_vectorizer[clustered_movie]

    a = vectorizer.transform([clustered_movie]).toarray()
    b = vectorizer.transform( [target_movie] ).toarray()
    similarity = cosine_similarity( a, b )

else:
    clustered_movie = re.sub("[0-9]|[^\w']|[_]", " ",clustered_product )

    vectorizer = CountVectorizer(stop_words=None)
    vectorizer = vectorizer.fit([clustered_movie])

    fitted_vectorizer[clustered_movie] = vectorizer

    a = vectorizer.transform([clustered_movie]).toarray()
    b = vectorizer.transform( [target_movie] ).toarray()
    similarity = cosine_similarity( a, b )

return similarity[0][0]

1 个答案:

答案 0 :(得分:0)

在所有标题上一次适合CountVectorizer。保存模型。然后使用拟合模型进行变换。