我使用以下代码在~20,000,000个文档上生成了一个tf-idf模型,效果很好。问题是当我尝试使用linear_kernel计算相似性得分时,内存使用量会爆炸:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
train_file = "docs.txt"
train_docs = DocReader(train_file) #DocReader is a generator for individual documents
vectorizer = TfidfVectorizer(stop_words='english',max_df=0.2,min_df=5)
X = vectorizer.fit_transform(train_docs)
#predicting a new vector, this works well when I check the predictions
indoc = "This is an example of a new doc to be predicted"
invec = vectorizer.transform([indoc])
#This is where the memory blows up
similarities = linear_kernel(invec, X).flatten()
看起来这样不应该占用大量内存,将1行CSR与20行行CSR进行比较应输出1x20mil的ndarray。
Justy FYI:X是CSR矩阵〜内存为12 GB(我的计算机只有16个)。我已经尝试过研究gensim来取代它,但我无法找到一个很好的例子。
对我遗失的内容有任何疑问?
答案 0 :(得分:0)
您可以批量进行处理。这是一个基于您的代码片段的示例,但将数据集替换为sklearn中的某些内容。对于这个较小的数据集,我也以原始方式计算它,以显示结果是等效的。您可以使用更大的批量大小。
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.datasets import fetch_20newsgroups
train_docs = fetch_20newsgroups(subset='train')
vectorizer = TfidfVectorizer(stop_words='english', max_df=0.2,min_df=5)
X = vectorizer.fit_transform(train_docs.data)
#predicting a new vector, this works well when I check the predictions
indoc = "This is an example of a new doc to be predicted"
invec = vectorizer.transform([indoc])
#This is where the memory blows up
batchsize = 1024
similarities = []
for i in range(0, X.shape[0], batchsize):
similarities.extend(linear_kernel(invec, X[i:min(i+batchsize, X.shape[0])]).flatten())
similarities = np.array(similarities)
similarities_orig = linear_kernel(invec, X)
print((similarities == similarities_orig).all())
输出:
True