Question

我有一个大型数据文件，其文本格式为一行n行。每行都是实数或值为No Data的字符串。我已将此文本导入为名为nx1的{{1}}单元格。不，我想过滤掉数据，并使用Data值而不是nx1创建一个NaN数组。我已经设法使用一个简单的循环（见下文），问题是它很慢。

No data

有没有办法对其进行优化？

Answer 1

实际上，整个解析可以使用正确的参数化readtable function调用（无迭代，无消毒，无转换等）使用单行执行：

data = readtable('data.txt','Delimiter','\n','Format','%f','ReadVariableNames',false,'TreatAsEmpty','No data');

以下是我用作测试模板的文本文件的内容：

9.343410
11.54300
6.733000
-135.210
No data
34.23000
0.550001
No data
1.535000
-0.00012
7.244000
9.999999
34.00000
No data

这是输出（可以使用data.Var1以双精度矢量的形式检索输出：

Delimiter：由于您正在处理单个列，因此指定为换行符...这会阻止No data因为空格而生成两列。
Format：您需要数值。
TreatAsEmpty：这告诉函数将特定字符串视为空，默认情况下将空双精度设置为NaN。

Answer 2

如果你运行它，你可以找出哪种方法更快。它创建了一个11MB的文本文件，并使用各种方法读取它。

filename = 'data.txt';
%% generate data
fid = fopen(filename,'wt');
N = 1E6;
for ct = 1:N
    val = rand(1);
    if val<0.01
        fwrite(fid,sprintf('%s\n','No Data'));
    else
        fwrite(fid,sprintf('%f\n',val*1000));
    end
end
fclose(fid)

%% Tommaso Belluzzo
tic
data = readtable(filename,'Delimiter','\n','Format','%f','ReadVariableNames',false,'TreatAsEmpty','No Data');
toc

%% Camilo Rada
tic
[txtMat, nLines]=txt2mat(filename);
NoData=txtMat(:,1)=='N';
z = zeros(nLines,1);
z(NoData)=nan;
toc

%% Gelliant
tic
fid = fopen(filename,'rt');
z= textscan(fid, '%f', 'Delimiter','\n', 'whitespace',' ', 'TreatAsEmpty','No Data', 'EndOfLine','\n','TextType','char'); 
z=z{1};
fclose(fid);
toc

结果：

Elapsed time is 0.273248 seconds.
Elapsed time is 0.304987 seconds.
Elapsed time is 0.206315 seconds.

txt2mat很慢，即使没有将结果字符串矩阵转换为数字，它也可以通过readtable和textscan来表现。 textscan比readtable稍快。可能是因为它跳过了一些内部健全性检查，并且没有将结果数据转换为表格。

Answer 3

根据您的文件有多大以及您阅读此类文件的频率，您可能希望超出可读范围，这可能会非常慢。

编辑：经过测试，使用文件这么简单，下面的方法没有任何优势。该方法是为了读取RINEX文件而开发的，这些文件是大而复杂的，因为它们是具有不同列数和不同行中不同分隔符的aphanumeric。

我发现的最有效的方法是将整个文件作为字符矩阵读取，然后您可以轻松找到“无数据”行。如果您的实数用固定宽度格式化，您可以将它们从char转换为数字，其方式比str2double或类似函数更有效。

我写的用于将文本文件读入char矩阵的函数是：

function [txtMat, nLines]=txt2mat(filename)
% txt2mat Read the content of a text file to a char matrix
%   Read all the content of a text file to a matrix as wide as the longest
%   line on the file. Shorter lines are padded with blank spaces. New lines
%   are not included in the output.
%   New lines are identified by new line \n characters.

    % Reading the whole file in a string
    fid=fopen(filename,'r');
    fileData = char(fread(fid));
    fclose(fid);
    % Finding new lines positions
    newLines= fileData==sprintf('\n');
    linesEndPos=find(newLines)-1;

    % Calculating number of lines
    nLines=length(linesEndPos);
    % Calculating the width (number of characters) of each line
    linesWidth=diff([-1; linesEndPos])-1;
    % Number of characters per row including new lines
    charsPerRow=max(linesWidth)+1;

    % Initializing output var with blank spaces
    txtMat=char(zeros(charsPerRow,nLines,'uint8')+' ');

    % Computing a logical index to all characters of the input string to
    % their final positions
    charIdx=false(charsPerRow,nLines);
    % Indexes of all new lines
    linearInd = sub2ind(size(txtMat), (linesWidth+1)', 1:nLines);
    charIdx(linearInd)=true;
    charIdx=cumsum(charIdx)==0;

    % Filling output matrix
    txtMat(charIdx)=fileData(~newLines);
    % Cropping the last row coresponding to new lines characters and transposing
    txtMat=txtMat(1:end-1,:)';
end

然后，一旦你把所有数据都放在一个矩阵中（让我们假设它被命名为txtMat），你可以这样做：

NoData=txtMat(:,1)=='N';

如果你的数字字段有固定宽度，你可以比str2num更有效地将它们转换为数字

values=((txtMat(:,1:10)-'0')*[1e6; 1e5; 1e4; 1e3; 1e2; 10; 1; 0; 1e-1; 1e-2]);

我假设这些数字有7位数和2位小数，但您可以轻松地根据您的情况进行调整。

要完成，您需要设置NaN值：

values(NoData)=NaN;

这比可读或类似的功能更麻烦，但是如果你想要优化读数，这就更快了。如果你没有固定宽度数字，你仍然可以这样做，通过添加几行来计算数字位数并在进行转换之前找到小数点的位置，但这会慢一点。但是，我认为它仍然会更快。

优化在Matlab中读取数据

3 个答案: