高效导入半结构化文本

时间:2018-06-08 04:21:29

标签: matlab parsing import

我有从Tekscan压力映射系统保存的多个文本文件。我正在尝试找到将多个逗号分隔矩阵导入一个uint8类型的3-d矩阵的最有效方法。我开发了一个解决方案,它重复调用MATLAB函数dlmread。不幸的是,导入数据大约需要1.5分钟。我已经包含了以下代码。

此代码调用我写的其他两个函数metaextractframecount,我没有将它们包括在内,因为它们与回答手头的问题并不真正相关。

以下是我正在使用的文件示例的两个链接。

The first is a shorter file with 90 samples

The second is a longer file with 3458 samples

任何帮助将不胜感激

function pressureData = tekscanimport
% Import TekScan data from .asf file to 3d matrix of type double.

[id,path] = uigetfile('*.asf'); %User input for .asf file
if path == 0 %uigetfile returns zero on cancel
    error('You must select a file to continue')
end

path = strcat(path,id); %Concatenate path and id to full path

% function calls
pressureData.metaData = metaextract(path);
nLines = linecount(path); %Find number of lines in file
nFrames = framecount(path,nLines);%Find number of frames

rowStart = 25; %Default starting row to read from tekscan .asf file
rowEnd = rowStart + 41; %Frames are 42 rows long
colStart = 0;%Default starting col to read from tekscan .asf file
colEnd = 47;%Frames are 48 rows long
pressureData.frames = zeros([42,48,nFrames],'uint8');%Preallocate for speed

f = waitbar(0,'1','Name','loading Data...',...
    'CreateCancelBtn','setappdata(gcbf,''canceling'',1)');
setappdata(f,'canceling',0);

for i = 1:nFrames %Loop through file skipping frame metadata
    if getappdata(f,'canceling')
        break
    end
    waitbar(i/nFrames,f,sprintf('Loaded %.2f%%', i/nFrames*100));

    %Make repeated calls to dlmread
    pressureData.frames(:,:,i) = dlmread(path,',',[rowStart,colStart,rowEnd,colEnd]);
    rowStart = rowStart + 44;
    rowEnd = rowStart + 41;
end
delete(f)
end

1 个答案:

答案 0 :(得分:1)

我试一试。此代码在我的电脑上在3.6秒内打开您的大文件。诀窍是使用sscanf而不是str2doublestr2number函数。

clear all;tic
fid = fopen('tekscanlarge.txt','rt');
%read the header, stop at frame
header='';
l = fgetl(fid);
while length(l)>5&&~strcmp(l(1:5),'Frame')
    header=[header,l,sprintf('\n')];
    l = fgetl(fid);
    if length(l)<5,l(end+1:5)=' ';end
end
%all data at once
dat = fread(fid,inf,'*char');
fclose(fid);
%allocate space
res = zeros([48,42,3458],'uint8');
%get all line endings
LE = [0,regexp(dat','\n')];
i=1;
for ct = 2:length(LE)-1 %go line by line
    L = dat(LE(ct-1)+1:LE(ct)-1);
    if isempty(L),continue;end
    if all(L(1:5)==['Frame']')
        fr = sscanf(L(7:end),'%u');
        i=1;
        continue;
    end
    % sscan can only handle row-char with space seperation.
    res(:,i,fr) = uint8(sscanf(strrep(L',',',' '),'%u'));
    i=i+1;
end
toc

有谁知道转换的速度比sscanf更快?因为它花费大部分时间在这个功能上(2.17秒)。对于13.1MB的数据集,我发现它与内存的速度相比非常慢。

找到一种在0.2秒内完成此操作的方法,也可能对其他人有用。 此mex文件扫描数字的char值列表并将其报告回来。将其另存为mexscan.c并运行mex mexscan.c

#include "mex.h" 
/* The computational routine */
void calc(unsigned char *in, unsigned char *out, long Sout, long Sin)
{
    long ct = 0;
    int newnumber=0; 
    for (int i=0;i<Sin;i+=2){
        if (in[i]>=48 && in[i]<=57) { //it is a number
            out[ct]=out[ct]*10+in[i]-48;
            newnumber=1;
        } else { //it is not a number
            if (newnumber==1){
                ct++;
                if (ct>Sout){return;}
            }
            newnumber=0;
        }
    }    
}

/* The gateway function */
void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
    unsigned char *in;             /* input vector */
    long Sout;                     /* input size of output vector */
    long Sin;                      /* size of input vector */
    unsigned char *out;            /* output vector*/

    /* check for proper number of arguments */
    if(nrhs!=2) {
        mexErrMsgIdAndTxt("MyToolbox:arrayProduct:nrhs","two input required.");
    }
    if(nlhs!=1) {
        mexErrMsgIdAndTxt("MyToolbox:arrayProduct:nlhs","One output required.");
    }
    /* make sure the first input argument is type char */
    if(!mxIsClass(prhs[0], "char"))  {
        mexErrMsgIdAndTxt("MyToolbox:arrayProduct:notDouble","Input matrix must be type char.");
    }
    /* make sure the second input argument is type uint32 */
    if(!mxIsClass(prhs[0], "char"))  {
        mexErrMsgIdAndTxt("MyToolbox:arrayProduct:notDouble","Input matrix must be type char.");
    }

    /* get dimensions of the input matrix */
    Sin = mxGetM(prhs[0])*2;
    /* create a pointer to the real data in the input matrix  */
    in = (unsigned char *) mxGetPr(prhs[0]);
    Sout = mxGetScalar(prhs[1]);

    /* create the output matrix */
    plhs[0] = mxCreateNumericMatrix(1,Sout,mxUINT8_CLASS,0);

    /* get a pointer to the real data in the output matrix */
    out = (unsigned char *) mxGetPr(plhs[0]);

    /* call the computational routine */
    calc(in,out,Sout,Sin);
}

现在这个脚本在0.2秒内运行并返回与前一个脚本相同的结果。

clear all;tic
fid = fopen('tekscanlarge.txt','rt');
%read the header, stop at frame
header='';
l = fgetl(fid);
while length(l)>5&&~strcmp(l(1:5),'Frame')
    header=[header,l,sprintf('\n')];
    l = fgetl(fid);
    if length(l)<5,l(end+1:5)=' ';end
end
%all data at once
dat = fread(fid,inf,'*char');
fclose(fid);
S=[48,42,3458];
d = mexscan(dat,uint32(prod(S)+3458));
d(1:prod(S(1:2))+1:end)=[];%remove frame numbers
d = reshape(d,S);
toc