为什么这个循环与二次时间成比例

时间:2017-08-01 10:40:01

标签: python loops numpy time time-complexity

我正在使用下面的代码重新排列numpy中的矩阵

t1=time.time()

df1=train1[1,1:52]
for i in xrange(40):
    for j in xrange(52,551):
        x=train1[i,(j-51):j]
        df1=np.vstack((df1,x))

t2=time.time()
t=t2-t1

在外环循环运行5圈[i in xrange(5)且j不变]时,它需要<1秒。 10匝,需要约4秒; 20转,需要约18秒; 40转,约85秒。

有人可以澄清为什么循环在二次时间内缩放,即使我们线性增加外循环。

由于

PS: 我在这里使用的矩阵是维基百科竞赛的Kaggle训练集的训练集。你可以从我读过的链接下载train_1.csv到一个pandas数据帧,然后使用.to_matrix()转换为numpy矩阵(即train1)

2 个答案:

答案 0 :(得分:2)

问题是对public class MyServlet { private static volatile proceed = true; public void doGet(req, resp) { ... doGetAnswer(req, resp); ... } public void doPost(req, resp) { ... doPostAnswer(req, resp); ... } public String doGetAnswer(req, resp) { if (proceed) return "Answer GET request"; else return "Do NOT answer GET request"; } public String doPostAnswer(req, resp) { proceed = false; return "POST called, stopping GET requests"; } } 的调用会在每次迭代中创建vstack的副本。 df1大小的Sice与外循环范围线性变化,得到二次运行时。

对代码进行概要分析表明,大部分时间都花费在df1上,concatenate调用了该代码:

vstack

您可以创建In [13]: cProfile.run('q.proc()') 259486 function calls in 19.759 seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.001 0.001 19.759 19.759 <string>:1(<module>) 39920 0.030 0.000 0.036 0.000 numeric.py:534(asanyarray) 19960 0.031 0.000 15.037 0.001 shape_base.py:182(vstack) 19960 0.020 0.000 0.121 0.000 shape_base.py:237(<listcomp>) 39920 0.057 0.000 0.101 0.000 shape_base.py:63(atleast_2d) 1 4.720 4.720 19.758 19.758 temp.py:6(proc) 1 0.000 0.000 19.759 19.759 {built-in method builtins.exec} 39920 0.003 0.000 0.003 0.000 {built-in method builtins.len} 39920 0.006 0.000 0.006 0.000 {built-in method numpy.core.multiarray.array} 19960 14.886 0.001 14.886 0.001 {built-in method numpy.core.multiarray.concatenate} 2 0.000 0.000 0.000 0.000 {built-in method time.time} 39920 0.005 0.000 0.005 0.000 {method 'append' of 'list' objects} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 的列表并在循环后连接。

修改:我将x定义为train1

答案 1 :(得分:1)

我只是使用更多时间检查并使用随机生成的train1矩阵

运行代码
import time
import numpy as np

t_total=time.time()
train1=np.random.random((20, 550))
df1=train1[1,1:52]
for i in range(5):
    t1 = time.time()
    tj = []
    for j in range(52,551):
        t2 = time.time()
        x=train1[i,(j-51):j]
        df1=np.concatenate((df1,x),axis=0)
        tj.append(time.time()-t2) 
    print("Time to loop on j:", time.time()-t1)
    print("Average time for each j:", np.mean(tj))

print("Total time:", time.time()-t_total)

当我运行时,我得到以下输出,显示每个循环显然越来越长。

Time to loop on j: 0.009780406951904297
Average time for each j: 1.9157577851e-05
Time to loop on j: 0.02693343162536621
Average time for each j: 5.33469932113e-05
Time to loop on j: 0.06705927848815918
Average time for each j: 0.000133752822876
Time to loop on j: 0.08919048309326172
Average time for each j: 0.000178138813179
Time to loop on j: 0.11366486549377441
Average time for each j: 0.000227188060661
Total time: 0.3072977066040039

我的猜测是:np.vstack只需要花费更多时间,因为矩阵输入的大小会增加,这就是导致执行时间呈指数增长的原因。我似乎无法找到相当于numpy来处理这个问题但是...一个有效的解决方案是将所有要堆叠的数组存储在列表中,然后在最后计算堆栈:

import time
import numpy as np

t_total=time.time()
train1=np.random.random((20, 550))
df1=[train1[1,1:52]]
for i in range(5):
    t1 = time.time()
    tj = []
    for j in range(52,551):
        t2 = time.time()
        x=train1[i,(j-51):j]
        df1.append(x)
        tj.append(time.time()-t2) 
    print("Time to loop on j:", time.time()-t1)
    print("Average time for each j:", np.mean(tj))
df1 = np.vstack(df1)
print("Total time:", time.time()-t_total)

这让我得到了运行时间:

Time to loop on j: 0.0005383491516113281
Average time for each j: 7.99347260194e-07
Time to loop on j: 0.0005192756652832031
Average time for each j: 7.58734876981e-07
Time to loop on j: 0.0005254745483398438
Average time for each j: 7.73546452035e-07
Time to loop on j: 0.0005245208740234375
Average time for each j: 7.73546452035e-07
Time to loop on j: 0.0005295276641845703
Average time for each j: 7.80235550447e-07
Total time: 0.008821249008178711

似乎堆叠许多小阵列比堆叠大型阵列或类似的东西更容易。然后将固定大小的对象附加到列表中具有固定的成本,无论列表的大小如何。