尝试计算非常多的向量的成对相似度时出现多处理错误

时间:2018-12-22 15:21:00

标签: python multiprocessing vectorization cosine-similarity

我有一个形状为424970 by 512的numpy数组。每个向量中的所有值都是浮点且非稀疏。

我可以访问具有3000 GB(3TB)内存和92个内核的群集。

当我尝试使用以下代码行计算成对相似度时,

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
vectors = np.load("vectors.npy")
similarity_matrix = cosine_similarity(vectors)

它内存不足。

然后,我尝试使用以下代码(from this answer)创建数组的块,因此可能(我不确定)减少了内存占用量

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
vectors = np.load("vectors.npy")
similarity_matrix = pairwise_kernels(vectors, metric = "cosine", n_jobs = 80)

但是现在运行了近一个小时后,我收到了这个奇怪的错误: ```

 File "/shared/centos7/python/3.7.0/lib/python3.7/site-packages/sklearn/metrics/pairwise.py", line 1405, in pairwise_kernels
    return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
  File "/shared/centos7/python/3.7.0/lib/python3.7/site-packages/sklearn/metrics/pairwise.py", line 1096, in _parallel_pairwise
    for s in gen_even_slices(Y.shape[0], n_jobs))
  File "/shared/centos7/python/3.7.0/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py", line 789, in __call__
    self.retrieve()
  File "/shared/centos7/python/3.7.0/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py", line 699, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/shared/centos7/python/3.7.0/lib/python3.7/multiprocessing/pool.py", line 657, in get
    raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '[array([[ 1.        ,  0.0775377 ,  0.0775377 , ..., -0.03980096,
         0.07633844,  0.09164172],
       [ 0.0775377 ,  1.        ,  1.        , ..., -0.00941773,
         0.0775912 ,  0.00236293],
       [ 0.0775377 ,  1.        ,  1.        , ..., -0.00941773,
         0.0775912 ,  0.00236293],
       ...,
       [-0.04346758, -0.01262202, -0.01262202, ..., -0.00163259,
         0.00362771,  0.05795608],
       [ 0.02601108,  0.06613082,  0.06613082, ...,  0.03252933,
        -0.02377374,  0.0862666 ],
       [-0.0505936 ,  0.05680718,  0.05680718, ..., -0.02619641,
         0.00659445,  0.13375922]])]'. Reason: 'error("'i' format requires -2147483648 <= number <= 2147483647")'

```

只需详细介绍一下,这是我的脚工作脚本:

#!/bin/bash
#SBATCH --job-name=cosine
#SBATCH --partition=XXXX
#SBATCH --nodes=1
#SBATCH --cpus-per-task=92
#SBATCH --mem=3000GB

module load python/3.7.0

python3 05_05_cosine_similarity_n_jobs.py

我有几个问题,

1)此错误是否表示我仍然用完内存?

2)如果对1.的回答为“是”,那么我可以做一些非常简单的调整来确保脚本运行吗?

3)如果没有简单的调整,那么设计解决方案的广泛方法是什么?

0 个答案:

没有答案