我正在解析一个充满数据的大型文本文件,然后将其作为* .mat文件保存到磁盘,这样我就可以轻松加载其中的部分内容(有关读取的详细信息,请参阅here)文件,以及here的数据)。为此,我一次读取一行,解析该行,然后将其附加到文件中。问题是文件本身比其中包含的数据大3个数量级!
以下是我的代码的精简版:
database = which('01_hit12.par');
[directory,filename,~] = fileparts(database);
matObj = matfile(fullfile(directory,[filename '.mat']),'Writable',true);
fidr = fopen(database);
hitranTemp = fgetl(fidr);
k = 1;
while ischar(hitranTemp)
if abs(hitranTemp(1)) == 32;
hitranTemp(1) = '0';
end
hitran = textscan(hitranTemp,'%2u%1u%12f%10f%10f%5f%5f%10f%4f%8f%15c%15c%15c%15c%6u%2u%2u%2u%2u%2u%2u%1c%7f%7f','delimiter','','whitespace','');
matObj.moleculeNumber(1,k) = uint8(hitran{1});
matObj.isotopeologueNumber(1,k) = uint8(hitran{2});
matObj.vacuumWavenumber(1,k) = hitran{3};
matObj.lineIntensity(1,k) = hitran{4};
matObj.airWidth(1,k) = single(hitran{6});
matObj.selfWidth(1,k) = single(hitran{7});
matObj.lowStateE(1,k) = single(hitran{8});
matObj.tempDependWidth(1,k) = single(hitran{9});
matObj.pressureShift(1,k) = single(hitran{10});
if rem(k,1e4) == 0;
display(sprintf('line %u (%2.2f)',k,100*k/K));
end
hitranTemp = fgetl(fidr);
k = k + 1;
end
fclose(fidr);
我解决了224,515行中的13,813行之后停止了代码,因为它已经花了很长时间并且文件大小变得很大,但是最后一次打印输出表明我刚刚清除了10k行。我清理了内存,然后跑了:
S = whos('-file','01_hit12.mat');
fileBytes = sum([S.bytes]);
T = dir(which('01_hit12.mat'));
diskBytes = T.bytes;
disp([fileBytes diskBytes diskBytes/fileBytes])
并获得输出:
524894 896189009 1707.37141022759
什么占用额外的895,664,115字节?我知道帮助页面说应该有一些额外的开销,但我觉得几乎Gb的描述性标题有点过分了!
新信息:
我尝试预先分配文件,认为也许MATLAB在循环中嵌入矩阵并在每次写入时为整个矩阵重新分配一块磁盘空间时所做的事情也是如此,这不是它。使用适当数据类型的零填充文件会生成我的简短检查脚本返回的文件:
8531570 71467 0.00837677004349727
这对我来说更有意义。 Matlab稀疏地保存文件,因此磁盘文件大小远小于内存中完整矩阵的大小。然而,一旦它开始用真实数据替换值,我就会得到与以前相同的行为,并且文件大小开始超出所有合理范围。
新的新信息:
在一个数据子集上尝试了这一点,100行长。要流式传输到磁盘,数据必须是v7.3格式,因此我通过我的脚本运行子集,将其加载到内存中,然后重新保存为v7.0格式。结果如下:
v7.3: 3800 8752 2.30
v7.0: 3800 2561 0.67
难怪v7.3格式不是默认格式。有没有人知道这方面的方法?这是一个错误还是一个功能?
答案 0 :(得分:2)
这对我来说似乎是个错误。解决方法是以块的形式写入到预先分配的数组。
通过预先分配:
开始fid = fopen('01_hit12.par', 'r');
data = fread(fid, inf, 'uint8');
nlines = nnz(data == 10) + 1;
fclose(fid);
matObj.moleculeNumber = zeros(1,nlines,'uint8');
matObj.isotopeologueNumber = zeros(1,nlines,'uint8');
matObj.vacuumWavenumber = zeros(1,nlines,'double');
matObj.lineIntensity = zeros(1,nlines,'double');
matObj.airWidth = zeros(1,nlines,'single');
matObj.selfWidth = zeros(1,nlines,'single');
matObj.lowStateE = zeros(1,nlines,'single');
matObj.tempDependWidth = zeros(1,nlines,'single');
matObj.pressureShift = zeros(1,nlines,'single');
然后以10000块的形式写,我修改了你的代码如下:
... % your code plus pre-alloc first
bs = 10000;
while ischar(hitranTemp)
if abs(hitranTemp(1)) == 32;
hitranTemp(1) = '0';
end
for ii = 1:bs,
hitran{ii} = textscan(hitranTemp,'%2u%1u%12f%10f%10f%5f%5f%10f%4f%8f%15c%15c%15c%15c%6u%2u%2u%2u%2u%2u%2 u%1c%7f%7f','delimiter','','whitespace','');
hitranTemp = fgetl(fidr);
if hitranTemp==-1, bs=ii; break; end
end
% this part really ugly, sorry! trying to keep it compact...
matObj.moleculeNumber(1,k:k+bs-1) = uint8(builtin('_paren',cellfun(@(c)c{1},hitran),1:bs));
matObj.isotopeologueNumber(1,k:k+bs-1) = uint8(builtin('_paren',cellfun(@(c)c{2},hitran),1:bs));
matObj.vacuumWavenumber(1,k:k+bs-1) = builtin('_paren',cellfun(@(c)c{3},hitran),1:bs);
matObj.lineIntensity(1,k:k+bs-1) = builtin('_paren',cellfun(@(c)c{4},hitran),1:bs);
matObj.airWidth(1,k:k+bs-1) = single(builtin('_paren',cellfun(@(c)c{5},hitran),1:bs));
matObj.selfWidth(1,k:k+bs-1) = single(builtin('_paren',cellfun(@(c)c{6},hitran),1:bs));
matObj.lowStateE(1,k:k+bs-1) = single(builtin('_paren',cellfun(@(c)c{7},hitran),1:bs));
matObj.tempDependWidth(1,k:k+bs-1) = single(builtin('_paren',cellfun(@(c)c{8},hitran),1:bs));
matObj.pressureShift(1,k:k+bs-1) = single(builtin('_paren',cellfun(@(c)c{9},hitran),1:bs));
k = k + bs;
fprintf('.');
end
fclose(fidr);
磁盘上的最终大小为21,393,408字节。用法分解为,
>> S = whos('-file','01_hit12.mat');
>> fileBytes = sum([S.bytes]);
>> T = dir(which('01_hit12.mat'));
>> diskBytes = T.bytes; ratio = diskBytes/fileBytes;
>> fprintf('%10d whos\n%10d disk\n%10.6f\n',fileBytes,diskBytes,ratio)
8531608 whos
21389582 disk
2.507099
效率仍然相当低,但并非失控。