将numpy数组写入文本文件的速度

时间:2018-12-17 18:12:53

标签: python performance numpy

我需要将一个非常“高”的两列数组写入文本文件,而且速度非常慢。我发现,如果将数组重塑为更宽的数组,则写入速度会更快。 例如

import time
import numpy as np
dataMat1 = np.random.rand(1000,1000)
dataMat2 = np.random.rand(2,500000)
dataMat3 = np.random.rand(500000,2)
start = time.perf_counter()
with open('test1.txt','w') as f:
    np.savetxt(f,dataMat1,fmt='%g',delimiter=' ')
end = time.perf_counter()
print(end-start)

start = time.perf_counter()
with open('test2.txt','w') as f:
    np.savetxt(f,dataMat2,fmt='%g',delimiter=' ')
end = time.perf_counter()
print(end-start)

start = time.perf_counter()
with open('test3.txt','w') as f:
    np.savetxt(f,dataMat3,fmt='%g',delimiter=' ')
end = time.perf_counter()
print(end-start)

在三个数据矩阵中元素数量相同的情况下,为什么最后一个要比其他两个要耗时得多?有什么方法可以加快“高”数据数组的写入速度?

2 个答案:

答案 0 :(得分:3)

savetxt的代码是Python并且可以访问。基本上,它对每一行/每一行进行格式化写入。实际上确实如此

for row in arr:
   f.write(fmt%tuple(row))

其中fmt源自您的fmt和数组的形状,例如

'%g %g %g ...'

因此,它正在为数组的每一行写文件。行格式也需要一些时间,但是它是使用Python代码在内存中完成的。

我希望loadtxt/genfromtxt将显示相同的时间模式-读取许多行需要更长的时间。

pandas的csv加载速度更快。我尚未看到有关其写入速度的任何讨论。

答案 1 :(得分:3)

hpaulj pointed out一样,savetxtlooping through the rows of X,并分别设置每一行的格式:

for row in X:
    try:
        v = format % tuple(row) + newline
    except TypeError:
        raise TypeError("Mismatch between array dtype ('%s') and "
                        "format specifier ('%s')"
                        % (str(X.dtype), format))
    fh.write(v)

我认为这里的主要杀手is是所有字符串插值调用。 如果我们将所有字符串插值打包到一个调用中,事情就会快得多:

with open('/tmp/test4.txt','w') as f:
    fmt = ' '.join(['%g']*dataMat3.shape[1])
    fmt = '\n'.join([fmt]*dataMat3.shape[0])
    data = fmt % tuple(dataMat3.ravel())
    f.write(data)

import io
import time
import numpy as np

dataMat1 = np.random.rand(1000,1000)
dataMat2 = np.random.rand(2,500000)
dataMat3 = np.random.rand(500000,2)
start = time.perf_counter()
with open('/tmp/test1.txt','w') as f:
    np.savetxt(f,dataMat1,fmt='%g',delimiter=' ')
end = time.perf_counter()
print(end-start)

start = time.perf_counter()
with open('/tmp/test2.txt','w') as f:
    np.savetxt(f,dataMat2,fmt='%g',delimiter=' ')
end = time.perf_counter()
print(end-start)

start = time.perf_counter()
with open('/tmp/test3.txt','w') as f:
    np.savetxt(f,dataMat3,fmt='%g',delimiter=' ')
end = time.perf_counter()
print(end-start)

start = time.perf_counter()
with open('/tmp/test4.txt','w') as f:
    fmt = ' '.join(['%g']*dataMat3.shape[1])
    fmt = '\n'.join([fmt]*dataMat3.shape[0])
    data = fmt % tuple(dataMat3.ravel())        
    f.write(data)
end = time.perf_counter()
print(end-start)

报告

0.1604848340011813
0.17416274400056864
0.6634929459996783
0.16207673999997496