最快的Matlab文件读取?

时间:2012-02-25 02:21:19

标签: matlab file-io

我的MATLAB程序正在读取大约7米行的文件,并且在I / O上浪费了太多时间。我知道每一行都被格式化为两个整数,但我不确切知道他们占用了多少个字符。 str2num是死的慢,我应该使用什么matlab函数?

Catch:我必须一次操作一行,而不存储整个文件内存,所以没有读取整个矩阵的命令都在桌面上。

fid = fopen('file.txt');
tline = fgetl(fid);
while ischar(tline)
    nums = str2num(tline);    
    %do stuff with nums
    tline = fgetl(fid);
end
fclose(fid);

4 个答案:

答案 0 :(得分:60)

问题陈述

这是一个共同的斗争,没有什么比测试更能回答了。以下是我的假设:

  1. 格式良好的ASCII文件,包含两列数字。没有标题,没有不一致的行等。

  2. 该方法必须扩展为读取太大而无法包含在内存中的文件(虽然我的耐心有限,所以我的测试文件只有500,000行)。

  3. 实际操作(OP调用“用nums做什么”)必须一次执行一行,不能进行矢量化。

  4. 讨论

    考虑到这一点,答案和评论似乎在三个方面鼓励效率:

    • 大批量阅读文件
    • 更有效地执行字符串到数字转换(通过批处理或使用更好的函数)
    • 使实际处理更有效率(我已通过上述规则3排除)。

    结果

    我整理了一个快速脚本来测试这些主题的6种变体的摄取速度(以及结果的一致性)。结果是:

    • 初始代码。 68.23秒。 582582检查
    • 使用sscanf,每行一次。 27.20 秒。 582582检查
    • 大批量使用fscanf。 8.93秒。 582582检查
    • 大批量使用文本扫描。 8.79秒。 582582检查
    • 将大批量读入内存,然后是sscanf。 8.15 秒582582检查
    • 在单行上使用java单行文件阅读器和sscanf。 63.56 秒。 582582检查
    • 使用java单项令牌扫描程序。 81.19 秒。 582582检查
    • 完全批量操作(不合规)。 1.02 秒。 508680检查(违反规则3)

    摘要

    原始时间的一半以上(68 - > 27秒)消耗了str2num调用中的低效率,可以通过切换sscanf来消除。

    通过使用较大的批次进行文件读取和字符串到数字转换,可以减少剩余时间(27 - > 8秒)的另外2/3。

    如果我们愿意违反原帖中的第3条规则,可以通过切换到完全数字处理来减少另外7/8的时间。但是,有些算法不适用于此,所以我们不管它。 (不是“检查”值与最后一个条目不匹配。)

    最后,直接矛盾的是我在此响应中的上一次编辑,通过切换可用的缓存Java单行读取器,无法节省成本。实际上,该解决方案比使用本机读取器的可比单行结果慢2-3倍。 (63对27秒)。

    下面介绍了上述所有解决方案的示例代码。


    示例代码

    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    %% Create a test file
    cd(tempdir);
    fName = 'demo_file.txt';
    fid = fopen(fName,'w');
    for ixLoop = 1:5
        d = randi(1e6, 1e5,2);
        fprintf(fid, '%d, %d \n',d);
    end
    fclose(fid);
    
    
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    %% Initial code
    CHECK = 0;
    tic;
    fid = fopen('demo_file.txt');
    tline = fgetl(fid);
    while ischar(tline)
        nums = str2num(tline);
        CHECK = round((CHECK + mean(nums) ) /2);
        tline = fgetl(fid);
    end
    fclose(fid);
    t = toc;
    fprintf(1,'Initial code.  %3.2f sec.  %d check \n', t, CHECK);
    
    
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    %% Using sscanf, once per line
    CHECK = 0;
    tic;
    fid = fopen('demo_file.txt');
    tline = fgetl(fid);
    while ischar(tline)
        nums = sscanf(tline,'%d, %d');
        CHECK = round((CHECK + mean(nums) ) /2);
        tline = fgetl(fid);
    end
    fclose(fid);
    t = toc;
    fprintf(1,'Using sscanf, once per line.  %3.2f sec.  %d check \n', t, CHECK);
    
    
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    %% Using fscanf in large batches
    CHECK = 0;
    tic;
    bufferSize = 1e4;
    fid = fopen('demo_file.txt');
    scannedData = reshape(fscanf(fid, '%d, %d', bufferSize),2,[])' ;
    while ~isempty(scannedData)
        for ix = 1:size(scannedData,1)
            nums = scannedData(ix,:);
            CHECK = round((CHECK + mean(nums) ) /2);
        end
        scannedData = reshape(fscanf(fid, '%d, %d', bufferSize),2,[])' ;
    end
    fclose(fid);
    t = toc;
    fprintf(1,'Using fscanf in large batches.  %3.2f sec.  %d check \n', t, CHECK);
    
    
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    %% Using textscan in large batches
    CHECK = 0;
    tic;
    bufferSize = 1e4;
    fid = fopen('demo_file.txt');
    scannedData = textscan(fid, '%d, %d \n', bufferSize) ;
    while ~isempty(scannedData{1})
        for ix = 1:size(scannedData{1},1)
            nums = [scannedData{1}(ix) scannedData{2}(ix)];
            CHECK = round((CHECK + mean(nums) ) /2);
        end
        scannedData = textscan(fid, '%d, %d \n', bufferSize) ;
    end
    fclose(fid);
    t = toc;
    fprintf(1,'Using textscan in large batches.  %3.2f sec.  %d check \n', t, CHECK);
    
    
    
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    %% Reading in large batches into memory, incrementing to end-of-line, sscanf
    CHECK = 0;
    tic;
    fid = fopen('demo_file.txt');
    bufferSize = 1e4;
    eol = sprintf('\n');
    
    dataBatch = fread(fid,bufferSize,'uint8=>char')';
    dataIncrement = fread(fid,1,'uint8=>char');
    while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
        dataIncrement(end+1) = fread(fid,1,'uint8=>char');  %This can be slightly optimized
    end
    data = [dataBatch dataIncrement];
    
    while ~isempty(data)
        scannedData = reshape(sscanf(data,'%d, %d'),2,[])';
        for ix = 1:size(scannedData,1)
            nums = scannedData(ix,:);
            CHECK = round((CHECK + mean(nums) ) /2);
        end
    
        dataBatch = fread(fid,bufferSize,'uint8=>char')';
        dataIncrement = fread(fid,1,'uint8=>char');
        while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
            dataIncrement(end+1) = fread(fid,1,'uint8=>char');%This can be slightly optimized
        end
        data = [dataBatch dataIncrement];
    end
    fclose(fid);
    t = toc;
    fprintf(1,'Reading large batches into memory, then sscanf.  %3.2f sec.  %d check \n', t, CHECK);
    
    
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    %% Using Java single line readers + sscanf
    CHECK = 0;
    tic;
    bufferSize = 1e4;
    reader =  java.io.LineNumberReader(java.io.FileReader('demo_file.txt'),bufferSize );
    tline = char(reader.readLine());
    while ~isempty(tline)
        nums = sscanf(tline,'%d, %d');
        CHECK = round((CHECK + mean(nums) ) /2);
        tline = char(reader.readLine());
    end
    reader.close();
    t = toc;
    fprintf(1,'Using java single line file reader and sscanf on single lines.  %3.2f sec.  %d check \n', t, CHECK);
    
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    %% Using Java scanner for file reading and string conversion
    CHECK = 0;
    tic;
    jFile = java.io.File('demo_file.txt');
    scanner = java.util.Scanner(jFile);
    scanner.useDelimiter('[\s\,\n\r]+');
    while scanner.hasNextInt()
        nums = [scanner.nextInt() scanner.nextInt()];
        CHECK = round((CHECK + mean(nums) ) /2);
    end
    scanner.close();
    t = toc;
    fprintf(1,'Using java single item token scanner.  %3.2f sec.  %d check \n', t, CHECK);
    
    
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    %% Reading in large batches into memory, vectorized operations (non-compliant solution)
    CHECK = 0;
    tic;
    fid = fopen('demo_file.txt');
    bufferSize = 1e4;
    eol = sprintf('\n');
    
    dataBatch = fread(fid,bufferSize,'uint8=>char')';
    dataIncrement = fread(fid,1,'uint8=>char');
    while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
        dataIncrement(end+1) = fread(fid,1,'uint8=>char');  %This can be slightly optimized
    end
    data = [dataBatch dataIncrement];
    
    while ~isempty(data)
        scannedData = reshape(sscanf(data,'%d, %d'),2,[])';
        CHECK = round((CHECK + mean(scannedData(:)) ) /2);
    
        dataBatch = fread(fid,bufferSize,'uint8=>char')';
        dataIncrement = fread(fid,1,'uint8=>char');
        while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
            dataIncrement(end+1) = fread(fid,1,'uint8=>char');%This can be slightly optimized
        end
        data = [dataBatch dataIncrement];
    end
    fclose(fid);
    t = toc;
    fprintf(1,'Fully batched operations.  %3.2f sec.  %d check \n', t, CHECK);
    

    (原始答案)

    要扩展Ben的观点...如果您逐行阅读这些文件,您的瓶颈将永远是文件I / O.

    据我所知,有时你无法将整个文件放入内存中。我通常会读取大量字符(1e5,1e6或其左右,具体取决于系统的内存)。然后我要么读取额外的单个字符(或者退回单个字符)以得到一个轮数,然后运行你的字符串解析(例如sscanf)。

    然后,如果你想要,你可以在重复整个过程之前一次处理一行的结果大矩阵,直到你读到文件的结尾。

    这有点单调乏味,但并不那么难。与单线阅读器相比,我通常看到速度提高了90%。


    (使用Java批处理读取器删除的可怕想法)

答案 1 :(得分:3)

即使您无法将整个文件放入内存中,也应使用矩阵读取函数读取大批量文件。

也许你甚至可以使用矢量操作来进行某些数据处理,这会加快进度。

答案 2 :(得分:3)

我使用memmapfile()获得了良好的结果(速度)。这样可以最大限度地减少内存数据的复制量,并利用内核的IO缓冲。你需要足够的空闲地址空间(虽然不是实际的可用内存)来映射整个文件,并有足够的空闲内存来保存输出变量(显然!)

下面的示例代码将文本文件读入int32类型的双列矩阵data

fname = 'file.txt';
fstats = dir(fname);
% Map the file as one long character string
m = memmapfile(fname, 'Format', {'uint8' [ 1 fstats.bytes] 'asUint8'});
textdata = char(m.Data(1).asUint8);
% Use textscan() to parse the string and convert to an int32 matrix
data = textscan(textdata, '%d %d', 'CollectOutput', 1);
data = data{:};
% Tidy up!
clear('m')

您可能需要调整textscan()的参数以获得您想要的内容 - 请参阅在线文档。

答案 3 :(得分:1)

我发现MATLAB读取csv文件的速度明显快于文本文件,因此如果可以使用其他软件将文本文件转换为csv,则可能会大大加快Matlab的运行速度。