我面临的问题是从包含数字和字符的文本文件中提取数据。我想要的数据(数字)由带有字符的行分隔,描述了以下数据集。文本文件相当大(> 2.000.000行)。
我尝试将每个数据集(两行之间的行数和字符)放入矩阵中。应根据每个数据集上方文本行中的描述(频率)命名矩阵。我有一个工作代码,但我遇到了性能问题。也许有人可以帮助我加快速度。一个文件目前需要大约15分钟。我需要矩阵中的数字来进一步处理它们。
文本文件片段:
21603 2135 21339 21604
103791 94 1 1 1 4
21339 1702 21600 21604
-1
-1
2414
1
Velocity (magnitude) Response at Structural FE Nodes
1
Frequency = 10.00 Hz
Result = Engineering Units
Component = Vmag
Location =
Form & Units = RMS Magnitude in m/s
1 5 1 11 2 1
1 0 1 1 1 0 0 0
1 2161
0.00000e+000 1.00000e+001 0.00000e+000 0.00000e+000 0.00000e+000 0.00000e+000
0.00000e+000 0.00000e+000 0.00000e+000 0.00000e+000 0.00000e+000 0.00000e+000
20008
1.23285e-004
20428
1.21613e-004
这是我的代码:
file='large_file.txt';
fid=fopen(file,'r');
k=1;
filerows=2164986; % nr of rows in textfile
A=zeros(filerows,6); % preallocate Matrix where textfile should be saved in
for count=1:8 % get rid of first 8 lines
fgets(fid);
end
name=0;
start=1;
while ~feof(fid)
a=fgets(fid);
b=str2double(strread(a,'%s')); % turn read row in a vector
if isnan(b(1))==1 % check whether there are characters in the row
if strfind(a,'Frequency') % check if 'Frequency' is in the row
Matrixname = sprintf('Frequency%i=A(%i:%i,:);',name,start,k);
eval(Matrixname);
name=b(3);
for count=1:10 % get rid of next 10 lines
fgets(fid);
end
start=k+1;
end
else % if there are just numbers in the row, insert it into the matrix
A(k,1:length(b))=b; % populate matrix A with the row entries
k = k+1;
end
k/filerows % show progress
end
fclose(fid);
Matrixname = sprintf('Frequency%i=A(%i:end,:);',name,start);
eval(Matrixname);
答案 0 :(得分:0)
使用matlab profiler可以帮助您查看哪些代码行花费的时间最多,以便您可以找出要优化的内容。
正如原始海报所确定的,在这种情况下造成麻烦的线是
k/filerows % show progress
多次打印到屏幕非常耗时。如果您希望在不降低代码速度的情况下显示进度,则可以执行
if mod(k,filerows/100) == 0
disp('k rows processed');
end
该代码将导致更新显示100次,或在该特定情况下每3.5秒显示一次。
如果你想变得非常喜欢,请查看等候栏,但这通常有点过分。
答案 1 :(得分:0)
最后我得到了sscanf解决方案。我使用该函数替换str2double函数以获得Why is str2double so slow in matlab as compared to a mex-function?中建议的速度。 可悲的是,它并没有做太多,但至少它有所帮助。
所以,开始是ca. 850S
删除进度状态后的Profiler:ca。 450S
用sscanf替换str2double后的Profiler:ca.330s
现在的代码是:
file='test.txt';
fid=fopen(file,'r');
k=1;
filerows=2164986; % nr of rows in textfile
A=zeros(filerows,6); % preallocate Matrix where textfile should be saved in
for count=1:8 % get rid of first 8 lines
fgets(fid);
end
name=0;
start=1;
while ~feof(fid)
a=fgets(fid);
b=strread(a,'%s');
b=sscanf(sprintf('%s#', b{:}), '%g#')';
if isempty(b) % check whether there had been characters in the row
if strfind(a,'Frequency') % check whether 'Frequency' was in the row
Matrixname = sprintf('Frequency%i=A(%i:%i,:);',name,start,k);
eval(Matrixname);
b=str2double(strread(a,'%s'));
name=b(3);
for count=1:8 % get rid of next 8 lines
fgets(fid);
end
start=k+1;
end
else % if there were just numbers in the row, insert it into the matrix
A(k,1:length(b))=b; % populate matrix A with the row entries
k = k+1;
end
end
fclose(fid);
Matrixname = sprintf('Frequency%i=A(%i:%i,:);',name,start,k);
eval(Matrixname);
答案 2 :(得分:0)
fid = fopen(file);
data = fread(fid,[1 maxBytes],'char=>char');
blockIndices = strfind(data,'Velocity'); % Calculate offsets based on data format
% Another approach much faster than for loops
lineData = regexp(data,sprintf('\n'),'split'); % No each line is in a cell
processedData = cellfun(@processData,lineData,'Uniform',false);
function y = processData(x)
% do something with x
end
一旦我得到块索引,我就可以计算出我想要的数据部分的偏移量。我不认为200万行是那么多数据,而且现在大多数计算机都有几千兆字节的内存,并且看起来每行不超过几百个字符,所以文件可能不到一半一个GB。祝你好运。