我有非常大的数据集存储在硬盘上的二进制文件中。以下是文件结构的示例:
文件标题
149 Byte ASCII Header
录制开始
4 Byte Int - Record Timestamp
示例开始
2 Byte Int - Data Stream 1 Sample
2 Byte Int - Data Stream 2 Sample
2 Byte Int - Data Stream 3 Sample
2 Byte Int - Data Stream 4 Sample
示例结束
每条记录有122,880个样本,每个文件有713条记录。这样产生的总大小为700,910,521字节。采样率和记录数有时会有所不同,因此我必须编码以检测每个文件的数量。
目前我用来将这些数据导入数组的代码如下:
from time import clock
from numpy import zeros , int16 , int32 , hstack , array , savez
from struct import unpack
from os.path import getsize
start_time = clock()
file_size = getsize(input_file)
with open(input_file,'rb') as openfile:
input_data = openfile.read()
header = input_data[:149]
record_size = int(header[23:31])
number_of_records = ( file_size - 149 ) / record_size
sample_rate = ( ( record_size - 4 ) / 4 ) / 2
time_series = zeros(0,dtype=int32)
t_series = zeros(0,dtype=int16)
x_series = zeros(0,dtype=int16)
y_series = zeros(0,dtype=int16)
z_series = zeros(0,dtype=int16)
for record in xrange(number_of_records):
time_stamp = array( unpack( '<l' , input_data[ 149 + (record * record_size) : 149 + (record * record_size) + 4 ] ) , dtype = int32 )
unpacked_record = unpack( '<' + str(sample_rate * 4) + 'h' , input_data[ 149 + (record * record_size) + 4 : 149 + ( (record + 1) * record_size ) ] )
record_t = zeros(sample_rate , dtype=int16)
record_x = zeros(sample_rate , dtype=int16)
record_y = zeros(sample_rate , dtype=int16)
record_z = zeros(sample_rate , dtype=int16)
for sample in xrange(sample_rate):
record_t[sample] = unpacked_record[ ( sample * 4 ) + 0 ]
record_x[sample] = unpacked_record[ ( sample * 4 ) + 1 ]
record_y[sample] = unpacked_record[ ( sample * 4 ) + 2 ]
record_z[sample] = unpacked_record[ ( sample * 4 ) + 3 ]
time_series = hstack ( ( time_series , time_stamp ) )
t_series = hstack ( ( t_series , record_t ) )
x_series = hstack ( ( x_series , record_x ) )
y_series = hstack ( ( y_series , record_y ) )
z_series = hstack ( ( z_series , record_z ) )
savez(output_file, t=t_series , x=x_series ,y=y_series, z=z_series, time=time_series)
end_time = clock()
print 'Total Time',end_time - start_time,'seconds'
目前每700 MB文件大约需要250秒,这对我来说似乎非常高。有没有更有效的方法可以做到这一点?
将numpy fromfile方法与自定义dtype一起使用将运行时间缩短为9秒,比上面的原始代码快27倍。最终的代码如下。
from numpy import savez, dtype , fromfile
from os.path import getsize
from time import clock
start_time = clock()
file_size = getsize(input_file)
openfile = open(input_file,'rb')
header = openfile.read(149)
record_size = int(header[23:31])
number_of_records = ( file_size - 149 ) / record_size
sample_rate = ( ( record_size - 4 ) / 4 ) / 2
record_dtype = dtype( [ ( 'timestamp' , '<i4' ) , ( 'samples' , '<i2' , ( sample_rate , 4 ) ) ] )
data = fromfile(openfile , dtype = record_dtype , count = number_of_records )
time_series = data['timestamp']
t_series = data['samples'][:,:,0].ravel()
x_series = data['samples'][:,:,1].ravel()
y_series = data['samples'][:,:,2].ravel()
z_series = data['samples'][:,:,3].ravel()
savez(output_file, t=t_series , x=x_series ,y=y_series, z=z_series, fid=time_series)
end_time = clock()
print 'It took',end_time - start_time,'seconds'
答案 0 :(得分:15)
一些提示:
不要使用struct模块。相反,使用Numpy的结构化数据类型和fromfile
。点击此处:http://scipy-lectures.github.com/advanced/advanced_numpy/index.html#example-reading-wav-files
您可以通过将合适的count =传递给fromfile
来一次阅读所有记录。
这样的事情(未经测试,但你明白了):
import numpy as np file = open(input_file, 'rb') header = file.read(149) # ... parse the header as you did ... record_dtype = np.dtype([ ('timestamp', '<i4'), ('samples', '<i2', (sample_rate, 4)) ]) data = np.fromfile(file, dtype=record_dtype, count=number_of_records) # NB: count can be omitted -- it just reads the whole file then time_series = data['timestamp'] t_series = data['samples'][:,:,0].ravel() x_series = data['samples'][:,:,1].ravel() y_series = data['samples'][:,:,2].ravel() z_series = data['samples'][:,:,3].ravel()
答案 1 :(得分:2)
Numpy支持通过numpy.memmap将二进制数据从数据直接映射到类似对象的数组中。您可以通过偏移量来存储文件并提取所需的数据。
对于字节顺序正确性,只需在您读入的内容上使用numpy.byteswap。您可以使用条件表达式来检查主机系统的字节顺序:
if struct.pack('=f', np.pi) == struct.pack('>f', np.pi):
# Host is big-endian, in-place conversion
arrayName.byteswap(True)
答案 2 :(得分:2)
一个明显的低效率是在循环中使用hstack
:
time_series = hstack ( ( time_series , time_stamp ) )
t_series = hstack ( ( t_series , record_t ) )
x_series = hstack ( ( x_series , record_x ) )
y_series = hstack ( ( y_series , record_y ) )
z_series = hstack ( ( z_series , record_z ) )
在每次迭代中,这为每个系列分配一个稍大的数组,并将迄今为止所读取的所有数据复制到其中。这涉及批次不必要的复制,并可能导致不良的内存碎片。
我会在列表中累积time_stamp
的值,并在结尾处执行一个hstack
,并对record_t
等执行完全相同的操作。
如果这没有带来足够的性能提升,我会注释掉循环的主体并开始一次性地重新启动,以查看确切花费的时间。
答案 3 :(得分:0)
使用array
和struct.unpack
,我在类似问题(多分辨率多通道二进制数据文件)方面取得了令人满意的结果。在我的问题中,我想要每个通道的连续数据,但该文件具有面向间隔的结构,而不是面向通道的结构。
“秘密”是首先读取整个文件,然后才将已知大小的切片分发到所需的容器(在下面的代码中,self.channel_content[channel]['recording']
是array
类型的对象) :
f = open(somefilename, 'rb')
fullsamples = array('h')
fullsamples.fromfile(f, os.path.getsize(wholefilename)/2 - f.tell())
position = 0
for rec in xrange(int(self.header['nrecs'])):
for channel in self.channel_labels:
samples = int(self.channel_content[channel]['nsamples'])
self.channel_content[channel]['recording'].extend(fullsamples[position:position+samples])
position += samples
当然,我不能说这比提供的其他答案更好或更快,但至少是你可能会评估的内容。
希望它有所帮助!