Question

我在矩阵上使用cosine_similarity，并想知道是否有必要的内存。所以我创建了一个小片段：

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
n = 10000
mat = np.random.random((n, n))
sim = cosine_similarity(mat)

随着n的增长，矩阵当然会变得更大。我希望矩阵的大小为n**2 * 4个字节，即：

n = 10,000：400MB
n = 15,000：900MB
n = 20,000：1.6GB

我观察到更多的内存使用。我的系统有16GB，崩溃的次数为n = 20,000。为什么会这样？

我尝试过的

我见过How do I profile memory usage in Python?。所以我安装了memory-profiler并执行了

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

@profile
def cos(n):
    mat = np.random.random((n, n))
    sim = cosine_similarity(mat)
    return sim

sim = cos(n=10000)

使用

python3 -m memory_profiler memory_usage_cosine_similarity.py

得到

Line #    Mem usage    Increment   Line Contents
================================================
     4   62.301 MiB   62.301 MiB   @profile
     5                             def cos(n):
     6  825.367 MiB  763.066 MiB       mat = np.random.random((n, n))
     7 1611.922 MiB  786.555 MiB       sim = cosine_similarity(mat)
     8 1611.922 MiB    0.000 MiB       return sim

但是我对大多数事情感到困惑：

为什么@profile处于62.301 MiB（这么大）？
为什么mat 825 MiB而不是400 MB？
为什么sim和mat的大小不同？
为什么htop向我显示从3.1 GB增加到5.5 GB（2.4 GB），但是分析器说它仅需要1.6 GB？

忽略这一点，这就是我增加n时发生的事情：

n                cosine_similarity
1000              11.684 MiB
2000 (x2)         37.547 MiB (x 3.2)
4000 (x4)        134.027 MiB (x11.5)
8000 (x8)        508.316 MiB (x43.5)

所以cos cosine_similarity大致表现出O（n ** 1.8）行为。

如果我不使用n x n个矩阵，而是使用n x 100个矩阵，则会得到相似的数字：

n                cosine_similarity
1000               9.512 MiB MiB
2000 (x2)         33.543 MiB (x 3.5)
4000 (x4)        127.152 MiB (x13.4)
8000 (x8)        496.234 MiB (x52.2)

sklearn.metrics.pairwise.cosine_similarity的内存占用量是多少？

我尝试过的

0 个答案: