Question

我一直在寻找一个相对较大的文本文件，其中包括散布着其他文本的数字列，但实际上我只想要数字列。这里没有显示的其他一些文本没有按照这种规律的间隔显示。

文件格式：

*** LOTS OF OTHER TEXT AND NUMBERS ***

  iter  continuity  x-velocity  y-velocity           k     epsilon vf-vapour_ph     time/iter
   111  3.4714e-08  5.3037e-10  6.0478e-10  1.6219e-15  1.8439e-13  0.0000e+00  0:00:01   14
   112  3.2652e-08  5.0553e-10  5.6497e-10  1.3961e-15  1.5730e-13  0.0000e+00  0:00:01   13
   113  3.1371e-08  4.6175e-10  5.0506e-10  1.2020e-15  1.3419e-13  0.0000e+00  0:00:01   12
   114  3.0016e-08  4.4331e-10  4.7391e-10  1.0388e-15  1.1447e-13  0.0000e+00  0:00:01   11
   115  2.8702e-08  4.2111e-10  4.4778e-10  8.9904e-16  9.7680e-14  0.0000e+00  0:00:01   10
   116  2.7476e-08  4.1484e-10  4.2711e-10  7.7955e-16  8.3342e-14  0.0000e+00  0:00:01    9
   117  2.6436e-08  3.9556e-10  4.0601e-10  6.7890e-16  7.1113e-14  0.0000e+00  0:00:01    8
   118  2.5374e-08  3.8633e-10  3.8826e-10  5.9234e-16  6.0674e-14  0.0000e+00  0:00:00    7
   119  2.4292e-08  3.7473e-10  3.7584e-10  5.1814e-16  5.1786e-14  0.0000e+00  0:00:00    6
   120  2.3474e-08  3.5952e-10  3.5622e-10  4.5405e-16  4.4207e-14  0.0000e+00  0:00:00    5
   121  2.2612e-08  3.4485e-10  3.4159e-10  3.9910e-16  3.7707e-14  0.0000e+00  0:00:00    4
  iter  continuity  x-velocity  y-velocity           k     epsilon vf-vapour_ph     time/iter
   122  2.1992e-08  3.4100e-10  3.2964e-10  3.5272e-16  3.2204e-14  0.0000e+00  0:00:00    3
   123  2.1592e-08  3.2444e-10  3.0170e-10  3.1487e-16  2.7500e-14  0.0000e+00  0:00:00    2
   124  2.1053e-08  3.3145e-10  2.9325e-10  2.8009e-16  2.3485e-14  0.0000e+00  0:00:00    1
   125  2.0390e-08  3.1502e-10  2.7534e-10  2.5433e-16  2.0053e-14  0.0000e+00  0:00:00    0
  step  flow-time mfr_arm_inne mfr_arm_oute pressure_sta pressure_sta pressure_tot pressure_tot velocity_max velocity_min
     1  5.0000e-07 -5.5662e-08  1.4217e-07  6.0015e+00  5.9998e+00  6.0015e+00  5.9998e+00  2.8934e-04  3.3491e-10
Flow time = 5e-07s, time step = 1
799 more time steps

Updating solution at time levels N and N-1.
 done.


Writing data to output file.
Current time=0.000000  Position=-0.00000036409265555078  Velocity=0.000015  Net force=0.210322
Fluid force=-0.477050N, Stator force=0.200000N ,Spring force=-32.990534N ,Top force=0.000000N, Bottom force=33.007906N, External force=0.470000N

Next time=0.000001  Position=-0.00000036400170391852  Velocity=0.000182
Applying motion to dynamic zone.

*** CONTINUING TEXT AND NUMBERS ***

我想要的是：

111  3.4714e-08  5.3037e-10  6.0478e-10  1.6219e-15  1.8439e-13  0.0000e+00  0:00:01   14
112  3.2652e-08  5.0553e-10  5.6497e-10  1.3961e-15  1.5730e-13  0.0000e+00  0:00:01   13

到目前为止我的脚本有效，但需要大约80秒来完成整个过程。

我认为，在我的一些档案中，当时的冒号更加尴尬。有些文件会有更多或更少的列包含不同类型的数据，有些文件会在主要块的末尾添加额外的集合，例如：

  step  flow-time mfr_arm_inne mfr_arm_oute pressure_sta pressure_sta pressure_tot pressure_tot velocity_max velocity_min
     1  5.0000e-07 -5.5662e-08  1.4217e-07  6.0015e+00  5.9998e+00  6.0015e+00  5.9998e+00  2.8934e-04  3.3491e-10

我不打算获取这些数据，但它可以与我想要的行格式非常相似（有时相同）。

它主要是为了读取每一行，看看行前面的几个字符（基于迭代次数的长度）是否与我期望的那些字符匹配（从1,2,3开始.. .n）。我这样做的原因是尝试删除我不想要的“步骤...”下的行。但是，该文件长约180,000行（这是我最短的）所以你可以想象这会变得有点慢。

% read the raw data from the file
file = 'file.txt';
fid = fopen(file, 'r');
raw = textscan(fid, '%s', 'Delimiter', '\n');
fid = fclose(fid);
raw = raw{1,1};

% expression used for splitting the columns up
colExpr = '[\d\.e:\-\+]+';

% beginning number
iterNum = 1;

% loop through lines
for line = 1:length(raw);

    % convert to string for comparison
    iterStr = num2str(iterNum);
    thisLine = raw{line, 1};

    % if the right length and the right string,
    if length(iterStr) <= length(thisLine) && ...
            strcmp(thisLine(1:length(iterStr)), iterStr)

        % split the string
        result(iterNum,:) = regexp(thisLine,colExpr, 'match');

        iterNum = iterNum + 1;

    end

end

% convert to matrix
residuals = cellfun(@str2num, result);

使用分析器，我意识到num2str()函数是最慢的部分（20s），然后是int2str()（10s），尽管我看不到没有它的方法来读取数据成为循环的一部分。

想知道我是否缺少尝试和优化此过程的东西？

修改

我已经包含了更多我不想要的行和可能的不同格式来尝试和帮助答案。

Answer 1

由于您已将整个内容加载到单元格数组中（raw），因此您可以直接调用regexp 以删除错误的行。

%// Find lines that contain your data
matches = regexp(raw, '^\s*\d(.*?\de[+\-]\d){6}');

%// Empty matches (header lines) should be removed
toremove = cellfun(@isempty, matches);
raw = raw(~toremove);

然后，您可以使用str2num结合strjoin将结果转换为数字数组。

data = reshape(str2num(strjoin(raw)), 7, []).';

这个答案的好处是你可以避免使用任何类型的循环或重复函数调用，这些调用因减慢MATLAB而臭名昭着。

<强>更新

@ Pursuit答案的替代版本将是：

numbers = cellfun(@(x)sscanf(x, '%f %f %f %f %f %f %f').', raw, 'uni', 0);
numbers = cat(1, numbers{:});

Answer 2

这是一种不同的方法：我们首先在外部处理文件，例如：

# only keep lines starting with a digit
$ grep '^\s*[0-9]' file.txt > file2.txt

在Windows上，您可以使用findstr作为grep的等效内容：

C:\> findstr /R /c:"^[ \t]*[0-9]" file.txt > file2.txt

现在在MATLAB中，可以很容易地将结果数值数据加载为矩阵：

>> load -ascii file2.txt
>> t = array2table(file2, 'VariableNames',...
    {'iter','continuity','xvelocity','yvelocity','k','epsilon','vf_vapour_ph'})
t = 
    iter    continuity    xvelocity     yvelocity        k          epsilon      vf_vapour_ph
    ____    __________    __________    _________    __________    __________    ____________
     1             0      6.2376e-07            0     0.0018988        2708.2    0           
     2             0         0.21656      0.23499     0.0097531       0.13395    0           
     3             0         0.11755      0.12824     0.0032109        0.1146    0           
     4             0        0.068112     0.072691    0.00089801      0.062219    0           
     5             0        0.043498     0.045244    0.00020248      0.025923    0           
     6        0.1938        0.029107     0.029029    4.8399e-05     0.0099171    0           
     7       0.13594        0.020037     0.019577    1.5502e-05     0.0043624    0           
     8      0.097518        0.013805     0.013249    5.1736e-06     0.0023341    0           
     9      0.070467       0.0098312    0.0091925    1.8272e-06     0.0012615    0           
    10      0.051538       0.0071181    0.0064673    7.2446e-07     0.0007012    0           
    11      0.038065       0.0052115    0.0046128    4.2786e-07    0.00040619    0           
    12      0.028369       0.0038465    0.0033381    2.8256e-07    0.00025864    0           
    13      0.021326        0.002857    0.0024454    1.9279e-07    0.00016126    0

Answer 3

我会尝试在每一行上运行sscanf，并且只使用效果很好的行。

请注意：

raw{11} = '11  3.8065e-02  5.2115e-03  4.6128e-03  4.2786e-07  4.0619e-04  0.0000e+00'
raw{12} = 'iter  continuity  x-velocity  y-velocity           k     epsilon vf-vapour_ph'

然后

>> sscanf(raw{11},'%f')
ans =
                        11
                  0.038065
                 0.0052115
                 0.0046128
                4.2786e-07
                0.00040619
                         0

和

>> sscanf(raw{12},'%f')
ans =
     []

要完成这个想法，您的代码将如下所示：

%% Read the file
file = 'dataFile.txt';
fid = fopen(file, 'r');
raw = textscan(fid, '%s', 'Delimiter', '\n');
fid = fclose(fid);
raw = raw{1,1}

%% Parse the file into the "residuals" variable

nextLine = 1; %This is the index of next line to insert

%Go through each line, one at a time
for ix = 1:length(raw)    
    %Parse the line with sscanf
    numbers = sscanf(raw{ix},'%f');

    if ~isempty(numbers)  %Skip any row that did not parse, otherwise ...
        %If you know the number of columns, you could replace "~isempty()" with "length()== "

        if nextLine == 1
            %If this is the first line of numbers, then initialize the
            %"residuals" variable.
            residuals= zeros(length(raw), length(numbers));
        end

        %Store the data, and increment "nextLine"
        residuals(nextLine,:) = numbers;
        nextLine = nextLine + 1;
    end
end

%Now, trim the excess alloction from "residuals"
residuals = residuals(1:(nextLine-1),:)

（请告诉我它在速度方面的比较。）

MATLAB数据解析优化

3 个答案: