我正在使用下面的代码重新排列numpy中的矩阵
t1=time.time()
df1=train1[1,1:52]
for i in xrange(40):
for j in xrange(52,551):
x=train1[i,(j-51):j]
df1=np.vstack((df1,x))
t2=time.time()
t=t2-t1
在外环循环运行5圈[i in xrange(5)且j不变]时,它需要<1秒。 10匝,需要约4秒; 20转,需要约18秒; 40转,约85秒。
有人可以澄清为什么循环在二次时间内缩放,即使我们线性增加外循环。
由于
PS: 我在这里使用的矩阵是维基百科竞赛的Kaggle训练集的训练集。你可以从我读过的链接下载train_1.csv到一个pandas数据帧,然后使用.to_matrix()转换为numpy矩阵(即train1)
答案 0 :(得分:2)
问题是对public class MyServlet {
private static volatile proceed = true;
public void doGet(req, resp) {
...
doGetAnswer(req, resp);
...
}
public void doPost(req, resp) {
...
doPostAnswer(req, resp);
...
}
public String doGetAnswer(req, resp) {
if (proceed)
return "Answer GET request";
else
return "Do NOT answer GET request";
}
public String doPostAnswer(req, resp) {
proceed = false;
return "POST called, stopping GET requests";
}
}
的调用会在每次迭代中创建vstack
的副本。 df1
大小的Sice与外循环范围线性变化,得到二次运行时。
对代码进行概要分析表明,大部分时间都花费在df1
上,concatenate
调用了该代码:
vstack
您可以创建In [13]: cProfile.run('q.proc()')
259486 function calls in 19.759 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.001 0.001 19.759 19.759 <string>:1(<module>)
39920 0.030 0.000 0.036 0.000 numeric.py:534(asanyarray)
19960 0.031 0.000 15.037 0.001 shape_base.py:182(vstack)
19960 0.020 0.000 0.121 0.000 shape_base.py:237(<listcomp>)
39920 0.057 0.000 0.101 0.000 shape_base.py:63(atleast_2d)
1 4.720 4.720 19.758 19.758 temp.py:6(proc)
1 0.000 0.000 19.759 19.759 {built-in method builtins.exec}
39920 0.003 0.000 0.003 0.000 {built-in method builtins.len}
39920 0.006 0.000 0.006 0.000 {built-in method numpy.core.multiarray.array}
19960 14.886 0.001 14.886 0.001 {built-in method numpy.core.multiarray.concatenate}
2 0.000 0.000 0.000 0.000 {built-in method time.time}
39920 0.005 0.000 0.005 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
的列表并在循环后连接。
修改:我将x
定义为train1
。
答案 1 :(得分:1)
我只是使用更多时间检查并使用随机生成的train1
矩阵
import time
import numpy as np
t_total=time.time()
train1=np.random.random((20, 550))
df1=train1[1,1:52]
for i in range(5):
t1 = time.time()
tj = []
for j in range(52,551):
t2 = time.time()
x=train1[i,(j-51):j]
df1=np.concatenate((df1,x),axis=0)
tj.append(time.time()-t2)
print("Time to loop on j:", time.time()-t1)
print("Average time for each j:", np.mean(tj))
print("Total time:", time.time()-t_total)
当我运行时,我得到以下输出,显示每个循环显然越来越长。
Time to loop on j: 0.009780406951904297
Average time for each j: 1.9157577851e-05
Time to loop on j: 0.02693343162536621
Average time for each j: 5.33469932113e-05
Time to loop on j: 0.06705927848815918
Average time for each j: 0.000133752822876
Time to loop on j: 0.08919048309326172
Average time for each j: 0.000178138813179
Time to loop on j: 0.11366486549377441
Average time for each j: 0.000227188060661
Total time: 0.3072977066040039
我的猜测是:np.vstack
只需要花费更多时间,因为矩阵输入的大小会增加,这就是导致执行时间呈指数增长的原因。我似乎无法找到相当于numpy来处理这个问题但是...一个有效的解决方案是将所有要堆叠的数组存储在列表中,然后在最后计算堆栈:
import time
import numpy as np
t_total=time.time()
train1=np.random.random((20, 550))
df1=[train1[1,1:52]]
for i in range(5):
t1 = time.time()
tj = []
for j in range(52,551):
t2 = time.time()
x=train1[i,(j-51):j]
df1.append(x)
tj.append(time.time()-t2)
print("Time to loop on j:", time.time()-t1)
print("Average time for each j:", np.mean(tj))
df1 = np.vstack(df1)
print("Total time:", time.time()-t_total)
这让我得到了运行时间:
Time to loop on j: 0.0005383491516113281
Average time for each j: 7.99347260194e-07
Time to loop on j: 0.0005192756652832031
Average time for each j: 7.58734876981e-07
Time to loop on j: 0.0005254745483398438
Average time for each j: 7.73546452035e-07
Time to loop on j: 0.0005245208740234375
Average time for each j: 7.73546452035e-07
Time to loop on j: 0.0005295276641845703
Average time for each j: 7.80235550447e-07
Total time: 0.008821249008178711
似乎堆叠许多小阵列比堆叠大型阵列或类似的东西更容易。然后将固定大小的对象附加到列表中具有固定的成本,无论列表的大小如何。