Question

我有一个功能，可以读取存储在远程服务器上的文件上的二进制数据。第一次在给定文件上运行它的速度非常慢，而下次读取同一文件时则足够快。

这对我来说是个谜，因为文件是在子函数中打开和关闭的（因此性能应保持一致，而不取决于我之前是否已读取该特定文件）。

数据存储在〜300mb的二进制文件中，我需要读取每N个数字。例如。我可能会读第一个数字，第11个数字，第21个数字等等，直到文件结束。这可以描述为包含ncol x nrow矩阵的文件，其中第一组ncol编号是第1行，第二组ncol编号是第2行，依此类推。我想读取一列，即每行一个数字。

我使用f.seek（）将文件指针设置为文件中的正确位置。

以下代码在第一次运行时使用〜10s，随后运行0.2s。请注意，这一次仅读取数据，打开文件的速度很快。为什么每次打开和关闭文件时性能都会改变？

def ReadBinaryFile（filename，col，ncol）：

"""
Args: 
    filename: binary file with float numbers (4-byte) stored in columns
    col: column to read
    ncol: number of columns in file (needed to separate data)

Description:
    -reads 4-byte numeric data from specified column and converts to float

Returns: 
    X: list with numeric data

"""
#--------hard-coded variables:
numBytes=4 #4-byte numbers

#---------open file

f = open(filename,'rb') #open file
f.seek(0,2) #put pointer to end of file
nmax = f.tell()/numBytes #total length of data
nrow = nmax/ncol #number of rows 

X = [0]*nrow #allocate memory
t0 = time.time()  #start timer  

for i in xrange(nrow):

    f.seek((col-1+i*ncol)*numBytes)
    bindata = f.read(numBytes)

    X[i] = float(struct.unpack('f',bindata)[0])

t1 = time.time() #end timer
print('Elapsed time reading binary data: ' + str(t1-t0))    

f.close()

return X

更新：我重写了循环并将f.seek（）替换为f.read（）以在文件中向前移动。

  f.seek((col-1+i*ncol)*numBytes) #set initial file pointer
  for i in xrange(nrow):


    bindata = f.read(numBytes) #read desired data

    X[i] = float(struct.unpack('f',bindata)[0]) 
    f.read(numBytes*(ncol-1)) #move to next row by reading data.

原始代码，使用f.seek（）读取同一文件两次以移动文件指针：第一次读取文件= 10.2s，连续读取= 0.16s。

更新的代码，使用f.read（）两次读取同一文件以移动文件指针：
第一次读取文件= 1.67s，连续读取= 0.25s。

因此，第一次读取文件时，更新的代码快5倍，但下一次慢2倍。我在每次测试之间重新启动内核，并多次测试以验证结果。

这令人沮丧。我正在寻找加速代码的方法。所有的想法表示赞赏！

通过服务器读取数据时，python f.read（）和f.seek（）速度很慢

0 个答案: