Question

我正在学习使用cupy。但是我发现一个问题确实令人困惑。似乎cupy在程序起初中表现良好。当运行一段时间后，Cupy的速度似乎要慢得多。这是代码：

import cupy as np
from line_profiler import LineProfiler

def test(ary):
    for i in range(1000):
        ary**6

def main():
    rand=np.random.rand(1024,1024)
    test(rand)
    test(rand)
    test(rand)
    test(rand)
    test(rand)
    test(rand)
    test(rand)

lp = LineProfiler()
lp_wrapper = lp(main)
lp_wrapper()
lp.print_stats()

这是时间表现：

Timer unit: 2.85103e-07 s

Total time: 16.3308 s
File: E:\Desktop\test.py
Function: main at line 8

Line #      Hits         Time    Per Hit   % Time  Line Contents
==============================================================
     8                                             def main():
     9         1    1528817.0  1528817.0      2.7      rand=np.random.rand(1024,1024)
    10         1     111014.0   111014.0      0.2      test(rand)
    11         1      94528.0    94528.0      0.2      test(rand)
    12         1      95636.0    95636.0      0.2      test(rand)
    13         1      94892.0    94892.0      0.2      test(rand)
    14         1    7728318.0  7728318.0     13.5      test(rand)
    15         1   23872383.0 23872383.0     41.7      test(rand)
    16         1   23754666.0 23754666.0     41.5      test(rand)

当cupy完成5000次电源断开时，它变得非常慢。

我在Windows上运行了这段代码，CUDA版本是10.0

希望获得答案。非常感谢您！

谢谢您的回答！我打印了Cupy的内存使用情况：

import cupy as np

def test(ary):
    mempool = cupy.get_default_memory_pool()
    pinned_mempool = cupy.get_default_pinned_memory_pool()
    for i in range(1000):
        ary**6
    print("used bytes: %s"%mempool.used_bytes())
    print("total bytes: %s\n"%mempool.total_bytes())

def main():
    rand=np.random.rand(1024,1024)
    test(rand)
    test(rand)
    test(rand)
    test(rand)
    test(rand)
    test(rand)
    test(rand)

这是输出：

used bytes: 8388608
total bytes: 16777216

used bytes: 8388608
total bytes: 16777216

used bytes: 8388608
total bytes: 16777216

used bytes: 8388608
total bytes: 16777216

used bytes: 8388608
total bytes: 16777216

used bytes: 8388608
total bytes: 16777216

used bytes: 8388608
total bytes: 16777216

在迭代过程中，GPU内存使用似乎保持不变。

顺便说一句，有什么办法可以避免这种速度降低？

Answer 1

这是CUDA内核队列的问题。

请参阅以下内容：

在代码中观察到的短的执行是假，因为cupy返回立即当队列不填充。

实际表现是最后一行。

注意：这不是内存分配的问题（正如我最初在最初的回答中所建议的那样），但我在此处提供了记录的原始答案。

原始（错误的）答案

可能是由于重新分配。

当您import cupy时，cupy会分配“一定数量的” GPU内存。当cupy使用了所有内存时，它必须分配更多的内存。这样会增加执行时间。

迭代次数增加时Cupy变慢

1 个答案: