Question

我的MATLAB程序正在读取大约7米行的文件，并且在I / O上浪费了太多时间。我知道每一行都被格式化为两个整数，但我不确切知道他们占用了多少个字符。 str2num是死的慢，我应该使用什么matlab函数？

Catch：我必须一次操作一行，而不存储整个文件内存，所以没有读取整个矩阵的命令都在桌面上。

fid = fopen('file.txt');
tline = fgetl(fid);
while ischar(tline)
    nums = str2num(tline);    
    %do stuff with nums
    tline = fgetl(fid);
end
fclose(fid);

Answer 1

问题陈述

这是一个共同的斗争，没有什么比测试更能回答了。以下是我的假设：

格式良好的ASCII文件，包含两列数字。没有标题，没有不一致的行等。
该方法必须扩展为读取太大而无法包含在内存中的文件（虽然我的耐心有限，所以我的测试文件只有500,000行）。
实际操作（OP调用“用nums做什么”）必须一次执行一行，不能进行矢量化。

讨论

考虑到这一点，答案和评论似乎在三个方面鼓励效率：

大批量阅读文件
更有效地执行字符串到数字转换（通过批处理或使用更好的函数）
使实际处理更有效率（我已通过上述规则3排除）。

结果

我整理了一个快速脚本来测试这些主题的6种变体的摄取速度（以及结果的一致性）。结果是：

初始代码。 68.23秒。 582582检查
使用sscanf，每行一次。 27.20 秒。 582582检查
大批量使用fscanf。 8.93秒。 582582检查
大批量使用文本扫描。 8.79秒。 582582检查
将大批量读入内存，然后是sscanf。 8.15 秒582582检查
在单行上使用java单行文件阅读器和sscanf。 63.56 秒。 582582检查
使用java单项令牌扫描程序。 81.19 秒。 582582检查
完全批量操作（不合规）。 1.02 秒。 508680检查（违反规则3）

摘要

原始时间的一半以上（68 - > 27秒）消耗了str2num调用中的低效率，可以通过切换sscanf来消除。

通过使用较大的批次进行文件读取和字符串到数字转换，可以减少剩余时间（27 - > 8秒）的另外2/3。

如果我们愿意违反原帖中的第3条规则，可以通过切换到完全数字处理来减少另外7/8的时间。但是，有些算法不适用于此，所以我们不管它。（不是“检查”值与最后一个条目不匹配。）

最后，直接矛盾的是我在此响应中的上一次编辑，通过切换可用的缓存Java单行读取器，无法节省成本。实际上，该解决方案比使用本机读取器的可比单行结果慢2-3倍。（63对27秒）。

下面介绍了上述所有解决方案的示例代码。

示例代码

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Create a test file
cd(tempdir);
fName = 'demo_file.txt';
fid = fopen(fName,'w');
for ixLoop = 1:5
    d = randi(1e6, 1e5,2);
    fprintf(fid, '%d, %d \n',d);
end
fclose(fid);


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Initial code
CHECK = 0;
tic;
fid = fopen('demo_file.txt');
tline = fgetl(fid);
while ischar(tline)
    nums = str2num(tline);
    CHECK = round((CHECK + mean(nums) ) /2);
    tline = fgetl(fid);
end
fclose(fid);
t = toc;
fprintf(1,'Initial code.  %3.2f sec.  %d check \n', t, CHECK);


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using sscanf, once per line
CHECK = 0;
tic;
fid = fopen('demo_file.txt');
tline = fgetl(fid);
while ischar(tline)
    nums = sscanf(tline,'%d, %d');
    CHECK = round((CHECK + mean(nums) ) /2);
    tline = fgetl(fid);
end
fclose(fid);
t = toc;
fprintf(1,'Using sscanf, once per line.  %3.2f sec.  %d check \n', t, CHECK);


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using fscanf in large batches
CHECK = 0;
tic;
bufferSize = 1e4;
fid = fopen('demo_file.txt');
scannedData = reshape(fscanf(fid, '%d, %d', bufferSize),2,[])' ;
while ~isempty(scannedData)
    for ix = 1:size(scannedData,1)
        nums = scannedData(ix,:);
        CHECK = round((CHECK + mean(nums) ) /2);
    end
    scannedData = reshape(fscanf(fid, '%d, %d', bufferSize),2,[])' ;
end
fclose(fid);
t = toc;
fprintf(1,'Using fscanf in large batches.  %3.2f sec.  %d check \n', t, CHECK);


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using textscan in large batches
CHECK = 0;
tic;
bufferSize = 1e4;
fid = fopen('demo_file.txt');
scannedData = textscan(fid, '%d, %d \n', bufferSize) ;
while ~isempty(scannedData{1})
    for ix = 1:size(scannedData{1},1)
        nums = [scannedData{1}(ix) scannedData{2}(ix)];
        CHECK = round((CHECK + mean(nums) ) /2);
    end
    scannedData = textscan(fid, '%d, %d \n', bufferSize) ;
end
fclose(fid);
t = toc;
fprintf(1,'Using textscan in large batches.  %3.2f sec.  %d check \n', t, CHECK);



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Reading in large batches into memory, incrementing to end-of-line, sscanf
CHECK = 0;
tic;
fid = fopen('demo_file.txt');
bufferSize = 1e4;
eol = sprintf('\n');

dataBatch = fread(fid,bufferSize,'uint8=>char')';
dataIncrement = fread(fid,1,'uint8=>char');
while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
    dataIncrement(end+1) = fread(fid,1,'uint8=>char');  %This can be slightly optimized
end
data = [dataBatch dataIncrement];

while ~isempty(data)
    scannedData = reshape(sscanf(data,'%d, %d'),2,[])';
    for ix = 1:size(scannedData,1)
        nums = scannedData(ix,:);
        CHECK = round((CHECK + mean(nums) ) /2);
    end

    dataBatch = fread(fid,bufferSize,'uint8=>char')';
    dataIncrement = fread(fid,1,'uint8=>char');
    while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
        dataIncrement(end+1) = fread(fid,1,'uint8=>char');%This can be slightly optimized
    end
    data = [dataBatch dataIncrement];
end
fclose(fid);
t = toc;
fprintf(1,'Reading large batches into memory, then sscanf.  %3.2f sec.  %d check \n', t, CHECK);


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using Java single line readers + sscanf
CHECK = 0;
tic;
bufferSize = 1e4;
reader =  java.io.LineNumberReader(java.io.FileReader('demo_file.txt'),bufferSize );
tline = char(reader.readLine());
while ~isempty(tline)
    nums = sscanf(tline,'%d, %d');
    CHECK = round((CHECK + mean(nums) ) /2);
    tline = char(reader.readLine());
end
reader.close();
t = toc;
fprintf(1,'Using java single line file reader and sscanf on single lines.  %3.2f sec.  %d check \n', t, CHECK);

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using Java scanner for file reading and string conversion
CHECK = 0;
tic;
jFile = java.io.File('demo_file.txt');
scanner = java.util.Scanner(jFile);
scanner.useDelimiter('[\s\,\n\r]+');
while scanner.hasNextInt()
    nums = [scanner.nextInt() scanner.nextInt()];
    CHECK = round((CHECK + mean(nums) ) /2);
end
scanner.close();
t = toc;
fprintf(1,'Using java single item token scanner.  %3.2f sec.  %d check \n', t, CHECK);


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Reading in large batches into memory, vectorized operations (non-compliant solution)
CHECK = 0;
tic;
fid = fopen('demo_file.txt');
bufferSize = 1e4;
eol = sprintf('\n');

dataBatch = fread(fid,bufferSize,'uint8=>char')';
dataIncrement = fread(fid,1,'uint8=>char');
while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
    dataIncrement(end+1) = fread(fid,1,'uint8=>char');  %This can be slightly optimized
end
data = [dataBatch dataIncrement];

while ~isempty(data)
    scannedData = reshape(sscanf(data,'%d, %d'),2,[])';
    CHECK = round((CHECK + mean(scannedData(:)) ) /2);

    dataBatch = fread(fid,bufferSize,'uint8=>char')';
    dataIncrement = fread(fid,1,'uint8=>char');
    while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
        dataIncrement(end+1) = fread(fid,1,'uint8=>char');%This can be slightly optimized
    end
    data = [dataBatch dataIncrement];
end
fclose(fid);
t = toc;
fprintf(1,'Fully batched operations.  %3.2f sec.  %d check \n', t, CHECK);

（原始答案）

要扩展Ben的观点...如果您逐行阅读这些文件，您的瓶颈将永远是文件I / O.

据我所知，有时你无法将整个文件放入内存中。我通常会读取大量字符（1e5,1e6或其左右，具体取决于系统的内存）。然后我要么读取额外的单个字符（或者退回单个字符）以得到一个轮数，然后运行你的字符串解析（例如sscanf）。

然后，如果你想要，你可以在重复整个过程之前一次处理一行的结果大矩阵，直到你读到文件的结尾。

这有点单调乏味，但并不那么难。与单线阅读器相比，我通常看到速度提高了90％。

（使用Java批处理读取器删除的可怕想法）

Answer 2

即使您无法将整个文件放入内存中，也应使用矩阵读取函数读取大批量文件。

也许你甚至可以使用矢量操作来进行某些数据处理，这会加快进度。

Answer 3

我使用memmapfile()获得了良好的结果（速度）。这样可以最大限度地减少内存数据的复制量，并利用内核的IO缓冲。你需要足够的空闲地址空间（虽然不是实际的可用内存）来映射整个文件，并有足够的空闲内存来保存输出变量（显然！）

下面的示例代码将文本文件读入int32类型的双列矩阵data。

fname = 'file.txt';
fstats = dir(fname);
% Map the file as one long character string
m = memmapfile(fname, 'Format', {'uint8' [ 1 fstats.bytes] 'asUint8'});
textdata = char(m.Data(1).asUint8);
% Use textscan() to parse the string and convert to an int32 matrix
data = textscan(textdata, '%d %d', 'CollectOutput', 1);
data = data{:};
% Tidy up!
clear('m')

您可能需要调整textscan()的参数以获得您想要的内容 - 请参阅在线文档。

Answer 4

我发现MATLAB读取csv文件的速度明显快于文本文件，因此如果可以使用其他软件将文本文件转换为csv，则可能会大大加快Matlab的运行速度。

最快的Matlab文件读取？

4 个答案:

问题陈述

讨论

结果

摘要

示例代码