Question

我有以下性能问题，涉及大文本文件输入（~500k行）和后续数据解析。

考虑一个文本文件data.txt，其具有以下示例性结构，其中两个标题行可以重新出现在文本文件的某处：

Name Date Val1 val2
--- ------- ---- ----
BA 2013-09-07 123.123 1232.22
BA 2013-09-08 435.65756 2314.34
BA 2013-09-09 234.2342 21342.342

我编写的代码以及以下代码：

%# Read in file using textscan, read all values as string

inFile = fopen('data.txt','r');
DATA = textscan(inFile, '%s %s %s %s');
fclose(inFile);

%# Remove the header lines everywhere in DATA:
%# Search indices of the first entry in first cell, i.e. 'Name', and remove 
%# all lines corresponding to those indices

[iHeader,~] = find(strcmp(DATA{1},DATA{1}(1)));
for i=1:length(DATA)
    DATA{i}(iHeader)=[];
end

%# Repeat again, the first entry corresponds now to '---'

[iHeader,~] = find(strcmp(DATA{1},DATA{1}(1)));
for i=1:length(DATA)
    DATA{i}(iHeader)=[];
end

%# Now convert the cells for column Val1 and Val2 in data.txt to doubles
%# since they have been read in as strings:

for i=3:4
    [A] = cellfun(@str2double,DATA{i});
    DATA{i} = A;
end

我选择在oder中读取所有内容作为字符串，以便能够删除DATA中删除标题行的所有内容。

停止时间告诉我代码中最慢的部分是转换[A] = cellfun(@str2double,DATA{i})，尽管与str2double相比，str2num已经是更快的选择。第二个最慢的部分是textscan。

现在的问题是，有更快的方法来处理这个问题吗？

如果我需要进一步澄清，请告诉我。请原谅我，如果有一个非常明显的解决方案，我还没有看到，我现在只用Matlab工作了三个星期。

Answer 1

您可以使用名为textscan的{{1}}选项，该选项将跳过部分文件（在您的情况下重复的2个标题行），并在一个函数调用中读取您的文件。

由于doc says，CommentStyle可以两种方式使用：单个字符串（如CommentStyle）可忽略同一行上字符串后面的字符，或者单元格数组为2字符串，例如'%'，用于忽略两个字符串之间的字符（包括行尾）。我们将在此处使用第二个选项：删除{'/*', '*/'}和Name之间的字符。由于结束字符串由重复的-字符组成，我们需要指定整个字符串。

您可以使用inFile = fopen('data.txt','r'); DATA = textscan(inFile, '%s %s %f %f', ... 'Commentstyle', {'Name';'--- ------- ---- ----'}); fclose(inFile);将日期字符串转换为有意义的数字。

datenum

Answer 2

虽然从长远来看，如果可能修复数据采集以避免这种情况会更好，但您可以利用文本扫描中的HeaderLines。

此示例代码将起作用，但如果可能，则预先分配c3 / c4（即通过估计上限大小和之后关闭修剪零）。基本上，在第一次调用textscan时，它将跳过前两行，并继续直到它遇到与格式不兼容的行（例如，在重复的标题的中间），或者直到它到达文件的末尾。不过，它记得它的位置。

下次调用textscan时，它会跳过该行的其余部分和下一整行，然后继续（直到eof或另一组标题行等）。如果您已到达文件末尾，则文本扫描将正常运行，但length(data{3})应为零。

c3 = [];
c4 = [];
fid = fopen('data.txt');
data = textscan(fid,'%s %s %f %f','HeaderLines',2);
l = length(data{3});
while l>0  %stop when we hit eof
  c3 = [c3; data{3}];
  c4 = [c4; data{4}];
  data = textscan(fid,'%s %s %f %f','HeaderLines',2);
  l = length(data{3});
end

从具有增强性能的文本文件中删除重复行

2 个答案: