Question

我有一个传感器单元，可以在大型二进制文件中生成数据。文件大小可能达到几十个千兆字节。我需要：

读取数据。
处理它以提取我想要的必要信息。
显示/可视化数据。

二进制文件中的数据格式为：单精度浮点数，即numpy.float32

我已经编写了运行良好的代码。我现在正在寻找优化时间的方法。我发现读取二进制数据花了很长时间。以下是我现在所拥有的：

def get_data(n):
'''
Function to get relevant trace data from the data file.
Usage :
    get_data(n)
    where n is integer containing relevant trace number to be read
Return :
    data_array : Python array containing single wavelength data.
''' 
with open(data_file, 'rb') as fid:
    data_array = list(np.fromfile(fid, np.float32)[n*no_of_points_per_trace:(no_of_points_per_trace*(n+1))])
return data_array

这使我可以迭代n的值并获得不同的轨迹，即数据块。顾名思义，变量no_of_points_per_trace包含每个跟踪中的点数。我是从单独的.info文件中获得的。

有没有一种最佳的方法？

Answer 1

现在，当您执行np.fromfile(fid, np.float32)时，您正在将整个文件读入内存。如果合适，并且您想访问大量的跟踪记录（如果您要为函数n调用许多不同的值），那么唯一的提速就是避免多次读取它。因此，也许您可能想读取整个文件，然后让函数索引到其中：

# just once:
with open(data_file, 'rb') as fid:
    alldata = list(np.fromfile(fid, np.float32)

# then use this function
def get_data(alldata, n):
    return alldata[n*no_of_points_per_trace:(no_of_points_per_trace*(n+1))])

现在，如果您发现自己只需要从大文件中提取一两个痕迹，则可以查找该文件并阅读所需的部分：

def get_data(n):
    dtype = np.float32
    with open(data_file, 'rb') as fid:
        fid.seek(dtype().itemsize*no_of_points_per_trace*n)
        data_array = np.fromfile(fid, dtype, count=no_of_points_per_trace)
    return data_array

您会发现我已经跳过了转换为列表的操作。这是一个缓慢的步骤，可能对您的工作流不是必需的。

在Python中从大型Binary文件读取特定数据块的最快方法是什么

1 个答案: