python scikit-learn余弦相似度值错误:无法转换整数标量

时间:2017-03-28 21:28:41

标签: python scikit-learn cosine-similarity sklearn-pandas

我正在尝试使用应用程序的文本描述来生成余弦相似度矩阵。下面的脚本首先读入一个csv数据文件(我可以根据需要提供数据文件),该文件包含两列,一列有两个应用类别,另一列带有这两个类别中每个类别的许多应用的标记化的词干描述。然后该脚本创建一个tfidf矩阵并尝试生成余弦相似度矩阵。

我昨天为Windows更新了Anaconda 64位,以确保我拥有最新版本的Python,numpy,scipy和scikit-learn。

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import os

print ('reading file into pandas')
data = pd.read_csv(os.path.join('inputfile.csv'))
cats = np.unique(data['category'])

for i in cats:
    print ()
    print ('prepping', i)
    d2 = data[data.category == i]
    descStem = d2.descStem.tolist()

    print ('vectorizing', i)
    tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,1), min_df=2, stop_words='english')
    tfidf_matrix = tfidf_vectorizer.fit_transform(descStem)
    print (tfidf_matrix.shape)

    print ('calculating cosine sim', i)
    cosOrig = cosine_similarity(tfidf_matrix, tfidf_matrix)

该脚本适用于较小类别的漫画,使用tdidf_matrix.shape =(3119,8217)。但是,我收到了下面针对更大类别的教育的错误消息,其中包含tfidf_matrix.shape =(90327,62863)。该矩阵为larger than 2^32

Traceback (most recent call last):

File "<ipython-input-1-4b2586ddeca4>", line 1, in <module>

runfile('Z:/rangus/gplay/marcello/data/similarity/error/cosSimByCatScrapeError.py', wdir='Z:/rangus/gplay/marcello/data/similarity/error')

File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile
execfile(filename, namespace)

File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)

File "Z:/rangus/gplay/marcello/data/similarity/error/cosSimByCatScrapeError.py", line 23, in <module>
cosOrig = cosine_similarity(tfidf_matrix, tfidf_matrix)

File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\sklearn\metrics\pairwise.py", line 918, in cosine_similarity
K = safe_sparse_dot(X_normalized, Y_normalized.T, dense_output=dense_output)

File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\sklearn\utils\extmath.py", line 186, in safe_sparse_dot
ret = ret.toarray()

File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\scipy\sparse\compressed.py", line 920, in toarray
return self.tocoo(copy=False).toarray(order=order, out=out)

File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\scipy\sparse\coo.py", line 258, in toarray
B.ravel('A'), fortran)

ValueError: could not convert integer scalar

我可以通过运行下面的代码来克服这个错误,但使用密集矩阵是一个巨大的内存占用,我需要在40多个类别上运行此脚本。

print ('vectorizing', i)
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,1), min_df=2, stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(descStem)
tfidf_matrixD = tfidf_matrix.toarray()

print ('calculating cosine sim', i)
cosOrig = cosine_similarity(tfidf_matrixD, tfidf_matrixD)

这是我在StackOverflow上可以找到的closest similar issue,但我无法看出它对我的情况有何帮助......

0 个答案:

没有答案