我的MATLAB程序正在读取大约7米行的文件,并且在I / O上浪费了太多时间。我知道每一行都被格式化为两个整数,但我不确切知道他们占用了多少个字符。 str2num是死的慢,我应该使用什么matlab函数?
Catch:我必须一次操作一行,而不存储整个文件内存,所以没有读取整个矩阵的命令都在桌面上。
fid = fopen('file.txt');
tline = fgetl(fid);
while ischar(tline)
nums = str2num(tline);
%do stuff with nums
tline = fgetl(fid);
end
fclose(fid);
答案 0 :(得分:60)
这是一个共同的斗争,没有什么比测试更能回答了。以下是我的假设:
格式良好的ASCII文件,包含两列数字。没有标题,没有不一致的行等。
该方法必须扩展为读取太大而无法包含在内存中的文件(虽然我的耐心有限,所以我的测试文件只有500,000行)。
实际操作(OP调用“用nums做什么”)必须一次执行一行,不能进行矢量化。
考虑到这一点,答案和评论似乎在三个方面鼓励效率:
我整理了一个快速脚本来测试这些主题的6种变体的摄取速度(以及结果的一致性)。结果是:
原始时间的一半以上(68 - > 27秒)消耗了str2num调用中的低效率,可以通过切换sscanf来消除。
通过使用较大的批次进行文件读取和字符串到数字转换,可以减少剩余时间(27 - > 8秒)的另外2/3。
如果我们愿意违反原帖中的第3条规则,可以通过切换到完全数字处理来减少另外7/8的时间。但是,有些算法不适用于此,所以我们不管它。 (不是“检查”值与最后一个条目不匹配。)
最后,直接矛盾的是我在此响应中的上一次编辑,通过切换可用的缓存Java单行读取器,无法节省成本。实际上,该解决方案比使用本机读取器的可比单行结果慢2-3倍。 (63对27秒)。
下面介绍了上述所有解决方案的示例代码。
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Create a test file
cd(tempdir);
fName = 'demo_file.txt';
fid = fopen(fName,'w');
for ixLoop = 1:5
d = randi(1e6, 1e5,2);
fprintf(fid, '%d, %d \n',d);
end
fclose(fid);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Initial code
CHECK = 0;
tic;
fid = fopen('demo_file.txt');
tline = fgetl(fid);
while ischar(tline)
nums = str2num(tline);
CHECK = round((CHECK + mean(nums) ) /2);
tline = fgetl(fid);
end
fclose(fid);
t = toc;
fprintf(1,'Initial code. %3.2f sec. %d check \n', t, CHECK);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using sscanf, once per line
CHECK = 0;
tic;
fid = fopen('demo_file.txt');
tline = fgetl(fid);
while ischar(tline)
nums = sscanf(tline,'%d, %d');
CHECK = round((CHECK + mean(nums) ) /2);
tline = fgetl(fid);
end
fclose(fid);
t = toc;
fprintf(1,'Using sscanf, once per line. %3.2f sec. %d check \n', t, CHECK);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using fscanf in large batches
CHECK = 0;
tic;
bufferSize = 1e4;
fid = fopen('demo_file.txt');
scannedData = reshape(fscanf(fid, '%d, %d', bufferSize),2,[])' ;
while ~isempty(scannedData)
for ix = 1:size(scannedData,1)
nums = scannedData(ix,:);
CHECK = round((CHECK + mean(nums) ) /2);
end
scannedData = reshape(fscanf(fid, '%d, %d', bufferSize),2,[])' ;
end
fclose(fid);
t = toc;
fprintf(1,'Using fscanf in large batches. %3.2f sec. %d check \n', t, CHECK);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using textscan in large batches
CHECK = 0;
tic;
bufferSize = 1e4;
fid = fopen('demo_file.txt');
scannedData = textscan(fid, '%d, %d \n', bufferSize) ;
while ~isempty(scannedData{1})
for ix = 1:size(scannedData{1},1)
nums = [scannedData{1}(ix) scannedData{2}(ix)];
CHECK = round((CHECK + mean(nums) ) /2);
end
scannedData = textscan(fid, '%d, %d \n', bufferSize) ;
end
fclose(fid);
t = toc;
fprintf(1,'Using textscan in large batches. %3.2f sec. %d check \n', t, CHECK);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Reading in large batches into memory, incrementing to end-of-line, sscanf
CHECK = 0;
tic;
fid = fopen('demo_file.txt');
bufferSize = 1e4;
eol = sprintf('\n');
dataBatch = fread(fid,bufferSize,'uint8=>char')';
dataIncrement = fread(fid,1,'uint8=>char');
while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
dataIncrement(end+1) = fread(fid,1,'uint8=>char'); %This can be slightly optimized
end
data = [dataBatch dataIncrement];
while ~isempty(data)
scannedData = reshape(sscanf(data,'%d, %d'),2,[])';
for ix = 1:size(scannedData,1)
nums = scannedData(ix,:);
CHECK = round((CHECK + mean(nums) ) /2);
end
dataBatch = fread(fid,bufferSize,'uint8=>char')';
dataIncrement = fread(fid,1,'uint8=>char');
while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
dataIncrement(end+1) = fread(fid,1,'uint8=>char');%This can be slightly optimized
end
data = [dataBatch dataIncrement];
end
fclose(fid);
t = toc;
fprintf(1,'Reading large batches into memory, then sscanf. %3.2f sec. %d check \n', t, CHECK);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using Java single line readers + sscanf
CHECK = 0;
tic;
bufferSize = 1e4;
reader = java.io.LineNumberReader(java.io.FileReader('demo_file.txt'),bufferSize );
tline = char(reader.readLine());
while ~isempty(tline)
nums = sscanf(tline,'%d, %d');
CHECK = round((CHECK + mean(nums) ) /2);
tline = char(reader.readLine());
end
reader.close();
t = toc;
fprintf(1,'Using java single line file reader and sscanf on single lines. %3.2f sec. %d check \n', t, CHECK);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using Java scanner for file reading and string conversion
CHECK = 0;
tic;
jFile = java.io.File('demo_file.txt');
scanner = java.util.Scanner(jFile);
scanner.useDelimiter('[\s\,\n\r]+');
while scanner.hasNextInt()
nums = [scanner.nextInt() scanner.nextInt()];
CHECK = round((CHECK + mean(nums) ) /2);
end
scanner.close();
t = toc;
fprintf(1,'Using java single item token scanner. %3.2f sec. %d check \n', t, CHECK);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Reading in large batches into memory, vectorized operations (non-compliant solution)
CHECK = 0;
tic;
fid = fopen('demo_file.txt');
bufferSize = 1e4;
eol = sprintf('\n');
dataBatch = fread(fid,bufferSize,'uint8=>char')';
dataIncrement = fread(fid,1,'uint8=>char');
while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
dataIncrement(end+1) = fread(fid,1,'uint8=>char'); %This can be slightly optimized
end
data = [dataBatch dataIncrement];
while ~isempty(data)
scannedData = reshape(sscanf(data,'%d, %d'),2,[])';
CHECK = round((CHECK + mean(scannedData(:)) ) /2);
dataBatch = fread(fid,bufferSize,'uint8=>char')';
dataIncrement = fread(fid,1,'uint8=>char');
while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
dataIncrement(end+1) = fread(fid,1,'uint8=>char');%This can be slightly optimized
end
data = [dataBatch dataIncrement];
end
fclose(fid);
t = toc;
fprintf(1,'Fully batched operations. %3.2f sec. %d check \n', t, CHECK);
(原始答案)
要扩展Ben的观点...如果您逐行阅读这些文件,您的瓶颈将永远是文件I / O.
据我所知,有时你无法将整个文件放入内存中。我通常会读取大量字符(1e5,1e6或其左右,具体取决于系统的内存)。然后我要么读取额外的单个字符(或者退回单个字符)以得到一个轮数,然后运行你的字符串解析(例如sscanf)。
然后,如果你想要,你可以在重复整个过程之前一次处理一行的结果大矩阵,直到你读到文件的结尾。
这有点单调乏味,但并不那么难。与单线阅读器相比,我通常看到速度提高了90%。
(使用Java批处理读取器删除的可怕想法)
答案 1 :(得分:3)
即使您无法将整个文件放入内存中,也应使用矩阵读取函数读取大批量文件。
也许你甚至可以使用矢量操作来进行某些数据处理,这会加快进度。
答案 2 :(得分:3)
我使用memmapfile()
获得了良好的结果(速度)。这样可以最大限度地减少内存数据的复制量,并利用内核的IO缓冲。你需要足够的空闲地址空间(虽然不是实际的可用内存)来映射整个文件,并有足够的空闲内存来保存输出变量(显然!)
下面的示例代码将文本文件读入int32类型的双列矩阵data
。
fname = 'file.txt';
fstats = dir(fname);
% Map the file as one long character string
m = memmapfile(fname, 'Format', {'uint8' [ 1 fstats.bytes] 'asUint8'});
textdata = char(m.Data(1).asUint8);
% Use textscan() to parse the string and convert to an int32 matrix
data = textscan(textdata, '%d %d', 'CollectOutput', 1);
data = data{:};
% Tidy up!
clear('m')
您可能需要调整textscan()
的参数以获得您想要的内容 - 请参阅在线文档。
答案 3 :(得分:1)
我发现MATLAB读取csv文件的速度明显快于文本文件,因此如果可以使用其他软件将文本文件转换为csv,则可能会大大加快Matlab的运行速度。