Question

对于非MATLAB的读者：不确定他们属于哪个系列，但是详细描述了here的MATLAB正则表达式。 MATLAB的注释字符为%（百分比），其字符串分隔符为'（撇号）。字符串内的字符串分隔符被写为双撇号（'this is how you write "it''s" in a string.'）。更复杂的是，矩阵转置运算符也是撇号（A'（Hermitian）或A.'（常规））。

现在，由于黑暗的原因（我将不详细阐述:)，我试图用MATLAB自己的语言解释MATLAB代码。

目前我正在尝试删除字符串单元格数组中的所有尾随注释，每个字符串都包含一行MATLAB代码。乍一看，这看起来很简单：

>> str = 'simpleCommand(); % simple trailing comment';
>> regexprep(str, '%.*$', '')
ans =
    simpleCommand();

但当然，这样的事情可能会出现：

>> str = ' fprintf(''%d%*c%3.0f\n'', value, args{:}); % Let''s do this! ';
>> regexprep(str, '%.*$', '') 
ans = 
    fprintf('        %//   <-- WRONG!

显然，我们需要从匹配中排除驻留在字符串中的所有注释字符，同时还要考虑直接在语句后面的单个撇号（或点 - 对照）是运算符 ，而不是字符串分隔符。

假设在注释字符之前字符串打开/关闭字符的数量必须甚至（我知道这是不完整的，因为矩阵转置运算符），我想出了以下动态正则表达式来处理这种情况：

>> str = { 'myFun( {''test'' ''%''}); % let''s ' 'sprintf(str, ''%*8.0f%*s%c%3d\n''); % it''s ' 'sprintf(str, ''%*8.0f%*s%c%3d\n''); % let''s ' 'sprintf(str, ''%*8.0f%*s%c%3d\n''); ' 'A = A.'';%tight trailing comment' }; >> >> C = regexprep(str, '(^.*)(?@mod(sum(\1==''''''''),2)==0;)(%.*$)', '$1')

然而，

C = 'myFun( {'test' '%'}); ' %// sucess 'sprintf(str, '%*8.0f%*s%c%3d\n'); ' %// sucess 'sprintf(str, '%*8.0f%*s%c%3d\n'); ' %// sucess 'sprintf(str, '%*8.0f%*s%c' %// FAIL 'A = A.';' %// success (although I'm not sure why)

所以我几乎那里，但还不完全:)

不幸的是，我已经花了很多时间来考虑这个并且需要继续其他事情，所以也许有更多时间的其他人足够友好地思考这些问题：

字符串中的注释字符是否需要注意例外？

这样做的正确和/或更有效的方法是什么？

Answer 1

您对使用未记录的功能感觉如何？如果您不反对，可以使用mtree函数来解析代码并删除注释。没有涉及正则表达式，we all know我们不应该尝试使用正则表达式解析无上下文的语法。

此函数是用纯M代码编写的MATLAB代码的完整解析器。据我所知，它是一个实验性的实现，但Mathworks已经在一些地方使用它（这与MATLAB Cody和Contests用来测量代码长度的功能相同），并且可以用于other有用的东西。

如果输入是字符串的cellarray，我们会这样做：

>> str = {..};
>> C = deblank(cellfun(@(s) tree2str(mtree(s)), str, 'UniformOutput',false))
C = 
    'myFun( { 'test', '%' } );'
    'sprintf( str, '%*8.0f%*s%c%3d\n' );'
    'sprintf( str, '%*8.0f%*s%c%3d\n' );'
    'sprintf( str, '%*8.0f%*s%c%3d\n' );'
    'A = A.';'

如果您已经在磁盘上存储了M文件，则可以将注释简单地删除为：

s = tree2str(mtree('myfile.m', '-file'))

如果您想要回复评论，请添加：mtree(.., '-comments')

Answer 2

这通过检查在一个

之前允许的字符来匹配共轭转置情况

数字2'
来信A'
点A.'
左括号，括号和括号A(1)'，A{1}'和[1 2 3]'

这是我现在能想到的唯一案例。

C = regexprep(str, '^(([^'']*''[^'']*''|[^'']*[\.a-zA-Z0-9\)\}\]]''[^'']*)*[^'']*)%.*$', '$1')

在你的例子中我们返回

>> C = regexprep(str, '^(([^'']*''[^'']*''|[^'']*[\.a-zA-Z0-9\)\}\]]''[^'']*)*[^'']*)%.*$', '$1')

C = 

    'myFun( {'test' '%'}); '
    'sprintf(str, '%*8.0f%*s%c%3d\n'); '
    'sprintf(str, '%*8.0f%*s%c%3d\n'); '
    'sprintf(str, '%*8.0f%*s%c%3d\n');  '
    'A = A.';'

Answer 3

看看我发现了什么！：）

The comment stripping toolbox，Peter J. Acklam。

对于m代码，它包含以下正则表达式：

mainregex = [ ...
     ' (                   ' ... % Grouping parenthesis (content goes to $1).
     '   ( ^ | \n )        ' ... % Beginning of string or beginning of line.
     '   (                 ' ... % Non-capturing grouping parenthesis.
     '                     ' ...
     '' ... % Match anything that is neither a comment nor a string...
     '       (             ' ... % Non-capturing grouping parenthesis.
     '           [\]\)}\w.]' ... % Either a character followed by
     '           ''+       ' ... %    one or more transpose operators
     '         |           ' ... % or else
     '           [^''%]    ' ... %   any character except single quote (which
     '                     ' ... %   starts a string) or a percent sign (which
     '                     ' ... %   starts a comment).
     '       )+            ' ... % Match one or more times.
     '                     ' ...
     '' ...  % ...or...
     '     |               ' ...
     '                     ' ...
     '' ...  % ...match a string.
     '       ''            ' ... % Opening single quote that starts the string.
     '         [^''\n]*    ' ... % Zero or more chars that are neither single
     '                     ' ... %   quotes (special) nor newlines (illegal).
     '         (           ' ... % Non-capturing grouping parenthesis.
     '           ''''      ' ... % An embedded (literal) single quote character.
     '           [^''\n]*  ' ... % Again, zero or more chars that are neither
     '                     ' ... %   single quotes nor newlines.
     '         )*          ' ... % Match zero or more times.
     '       ''            ' ... % Closing single quote that ends the string.
     '                     ' ...
     '   )*                ' ... % Match zero or more times.
     ' )                   ' ...
     ' [^\n]*              ' ... % What remains must be a comment.
              ];

  % Remove all the blanks from the regex.
  mainregex = mainregex(~isspace(mainregex));

哪个成为

mainregex  = '((^|\n)(([\]\)}\w.]''+|[^''%])+|''[^''\n]*(''''[^''\n]*)*'')*)[^\n]*'

，应该用作

C = regexprep(str, mainregex, '$1')

到目前为止，它经受住了我所有的测试，所以我认为这应该很好地解决了我的问题：）

Answer 4

我更喜欢滥用checkcode（替换旧mlint）来进行解析。这是一个建议

function strNC = removeComments(str)
if iscell(str)
    strNC = cellfun(@removeComments, str, 'UniformOutput', false);
elseif regexp(str, '%', 'once')
    err = getCheckCodeId(str);
    strNC = regexprep(str, '%[^%]*$', '');
    errNC = getCheckCodeId(strNC);
    if strcmp(err, errNC),
        strNC = removeComments(strNC);
    else
        strNC = str;
    end
else
    strNC = str;
end
end

function errid = getCheckCodeId(line)
fName = 'someTempFileName.m';
fh = fopen(fName, 'w');
fprintf(fh, '%s\n', line);
fclose(fh);
if exist('checkcode')
    structRep = checkcode(fName, '-id');
else
    structRep = mlint(fName, '-id');
end
delete(fName);
if isempty(structRep)
    errid = '';
else
    errid = structRep.id;
end
end

对于每一行，它会检查我们是否通过修剪从上一行%到行尾的行来引入错误。

对于您的示例，它返回：

>> removeComments(str)

ans = 

    'myFun( {'test' '%'}); '
    'sprintf(str, '%*8.0f%*s%c%3d\n'); '
    'sprintf(str, '%*8.0f%*s%c%3d\n'); '
    'sprintf(str, '%*8.0f%*s%c%3d\n');  '
    'A = A.';'

它不会删除抑制指令%#ok，因此您得到：

>> removeComments('a=1; %#ok')

ans =

a=1; %#ok

这可能是一件好事。

Answer 5

如何确保评论前的所有撇号都成对出现：

>> str = {
       'myFun( {''test'' ''%''}); % let''s '                 
       'sprintf(str, ''%*8.0f%*s%c%3d\n''); % it''s '        
       'sprintf(str, ''%*8.0f%*s%c%3d\n''); % let''s '       
       'sprintf(str, ''%*8.0f%*s%c%3d\n'');  '
   };

>> C = regexprep(str, '^(([^'']*''[^'']*'')*[^'']*)%.*$', '$1')

C = 
    myFun( {'test' '%'}); 
    sprintf(str, '%*8.0f%*s%c%3d\n'); 
    sprintf(str, '%*8.0f%*s%c%3d\n'); 
    sprintf(str, '%*8.0f%*s%c%3d\n');

如何通过regexp删除尾随注释？

5 个答案: