我有一个形状为424970 by 512
的numpy数组。每个向量中的所有值都是浮点且非稀疏。
我可以访问具有3000 GB(3TB)内存和92个内核的群集。
当我尝试使用以下代码行计算成对相似度时,
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
vectors = np.load("vectors.npy")
similarity_matrix = cosine_similarity(vectors)
它内存不足。
然后,我尝试使用以下代码(from this answer)创建数组的块,因此可能(我不确定)减少了内存占用量
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
vectors = np.load("vectors.npy")
similarity_matrix = pairwise_kernels(vectors, metric = "cosine", n_jobs = 80)
但是现在运行了近一个小时后,我收到了这个奇怪的错误: ```
File "/shared/centos7/python/3.7.0/lib/python3.7/site-packages/sklearn/metrics/pairwise.py", line 1405, in pairwise_kernels
return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
File "/shared/centos7/python/3.7.0/lib/python3.7/site-packages/sklearn/metrics/pairwise.py", line 1096, in _parallel_pairwise
for s in gen_even_slices(Y.shape[0], n_jobs))
File "/shared/centos7/python/3.7.0/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py", line 789, in __call__
self.retrieve()
File "/shared/centos7/python/3.7.0/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py", line 699, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/shared/centos7/python/3.7.0/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '[array([[ 1. , 0.0775377 , 0.0775377 , ..., -0.03980096,
0.07633844, 0.09164172],
[ 0.0775377 , 1. , 1. , ..., -0.00941773,
0.0775912 , 0.00236293],
[ 0.0775377 , 1. , 1. , ..., -0.00941773,
0.0775912 , 0.00236293],
...,
[-0.04346758, -0.01262202, -0.01262202, ..., -0.00163259,
0.00362771, 0.05795608],
[ 0.02601108, 0.06613082, 0.06613082, ..., 0.03252933,
-0.02377374, 0.0862666 ],
[-0.0505936 , 0.05680718, 0.05680718, ..., -0.02619641,
0.00659445, 0.13375922]])]'. Reason: 'error("'i' format requires -2147483648 <= number <= 2147483647")'
```
只需详细介绍一下,这是我的脚工作脚本:
#!/bin/bash
#SBATCH --job-name=cosine
#SBATCH --partition=XXXX
#SBATCH --nodes=1
#SBATCH --cpus-per-task=92
#SBATCH --mem=3000GB
module load python/3.7.0
python3 05_05_cosine_similarity_n_jobs.py
我有几个问题,
1)此错误是否表示我仍然用完内存?
2)如果对1.的回答为“是”,那么我可以做一些非常简单的调整来确保脚本运行吗?
3)如果没有简单的调整,那么设计解决方案的广泛方法是什么?