在计算数组的点积时,我有一个数组错误太大。
数据样本是:
metadata['overview'].head()
out: 0 Led by Woody, Andy's toys live happily in his ...
1 When siblings Judy and Peter discover an encha...
2 A family wedding reignites the ancient feud be...
3 Cheated on, mistreated and stepped on, the wom...
4 Just when George Banks has recovered from his ...
Name: overview, dtype: object
使用TF-IDF Vectorizer,这将给出一个矩阵,其中每列代表概览词汇表中的一个单词,每列代表一部电影。
#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')
#Replace NaN with an empty string
metadata['overview'] = metadata['overview'].fillna('')
#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(metadata['overview'])
#Output the shape of tfidf_matrix
tfidf_matrix.shape
out[3]: (45466, 75827)
我正在使用Sklearn类
from sklearn.metrics.pairwise import linear_kernel
# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
通过运行此行我遇到以下错误:
ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.
任何人都可以指导我如何解决此错误?