Question

我正在编写一个Python程序，用于使用正则表达式在c ++程序中搜索注释。我写了以下代码：

import re
regex = re.compile(r'(\/\/(.*?))\n|(\/\*(.|\n)*\*\/)')
comments = []
text = ""
while True:
    try:
        x= raw_input()
        text = text + "\n"+ x
    except EOFError:
        break
z = regex.finditer(text)
for match in z:
    print match.group(1)

此代码应检测//I'm comment和/*blah blah blah blah blah*/类型的评论我得到了以下输出：

// my  program in C++
None
//use cout

我没想到。我的想法是match.group（1）应该捕获(\/\*(.|\n)*\*\/)的第一个括号，但事实并非如此。我正在测试的c ++程序是：

// my  program in C++

#include <iostream>
/** I love c++
    This is awesome **/
using namespace std;

int main ()
{
  cout << "Hello World"; //use cout
  return 0;
}

Answer 1

您没有使用正确的顺序来执行此操作，因为内联注释可以包含在多行注释中。因此，您需要使用多行注释开始您的模式。例如：

/\*[\s\S]*?\*/|//.*

请注意，如果您有长多行注释，则可以改进此模式（此语法是re模块不支持的原子组功能的模拟）：

/\*(?:(?=([^*]+|\*(?!/))\1)*\*/|//.*

但请注意，还有其他陷阱，例如包含/*...*/或//.....的字符串。

因此，如果您想避免这些情况，例如，如果您想要进行替换，则需要在字符串之前捕获并在替换字符串中使用反向引用，如下所示：

(pattern for strings)|/\*[\s\S]*?\*/|//.*

替换：$1

Answer 2

使用组（0）＆ttxt＆＃39;中的内容文件就是你的例子：

import re
regex = re.compile(r'(\/\/(.*?))\n|(\/\*(.|\n)*\*\/)')
comments = []
text = ""
for line in open('txt').readlines():
    text = text + line
z = regex.finditer(text)
for match in z:
    print match.group(0).replace("\n","")

我的输出为：

// my  program in C++
/** I love c++        This is awesome **/
//use cout

帮助人们理解：

import re
regex = re.compile(r'((\/\/(.*?))\n|(\/\*(.|\n)*\*\/))')
comments = []
text = ""
for line in open('txt').readlines():
    text = text + line
z = regex.finditer(text)
for match in z:
    print match.group(1)

会输出：

// my  program in C++

/** I love c++
    This is awesome **/
//use cout

Answer 3

不幸的是，你必须同时解析引号和非注释因为
部分注释语法可以嵌入其中。

这是一个老式的Perl正则表达式。匹配的兴趣是Capture group 1
包含评论。所以使用全局搜索while循环。检查第1组匹配。

    # (/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|[\S\s][^/"'\\]*)


    (                                # (1 start), Comments 
         /\*                              # Start /* .. */ comment
         [^*]* \*+
         (?: [^/*] [^*]* \*+ )*
         /                                # End /* .. */ comment
      |  
         //                               # Start // comment
         (?: [^\\] | \\ \n? )*?           # Possible line-continuation
         \n                               # End // comment
    )                                # (1 end)
 |  
    (                                # (2 start), Non - comments 
         "
         (?: \\ [\S\s] | [^"\\] )*        # Double quoted text
         "
      |  '
         (?: \\ [\S\s] | [^'\\] )*        # Single quoted text
         ' 
      |  [\S\s]                           # Any other char
         [^/"'\\]*                        # Chars which doesn't start a comment, string, escape,
                                          # or line continuation (escape + newline)
    )                                # (2 end)

Answer 4

添加另一个答案。

（注意 - 您遇到的问题与更改顺序无关注释子表达式。）

你的是获得C ++评论的简化正则表达式版本如果你不想要完整的版本，我们可以看一看你为什么遇到问题。

首先，你的正则表达式几乎正确。有一个问题使用/* ... */注释的子表达式。必须制作内容 非贪婪。

除此之外，它的工作方式应该如此但你应该仔细看一下捕捉组在您的代码中，您只在每场比赛中打印第1组，即// ...
评论。你可以检查第1组和第3组的匹配，或者只打印出0组（整场比赛）。

此外，您不需要第2组中的延迟量词?和它下方的换行符\n应 NOT 在那里并且，考虑使所有捕获组不捕获(?: .. )。

因此，请移除?子表达式中的\n量词和// ...。
并在?子表达式中添加/* ... */量词。

这是您的原始正则表格格式 - （使用 RegexFormat 5 和自动评论）

    # raw regex:   (//(.*?))\n|(/\*(.|\n)*\*/)

    (                    # (1 start)
         //
         ( .*? )              # (2)
    )                    # (1 end)
    \n 
 |  
    (                    # (3 start)
         /\*
         ( . | \n )*          # (4)
         \*/
    )                    # (3 end)

这里没有捕获组和2个次要量词变化。

    # raw regex:   //(?:.*)|/\*(?:.|\n)*?\*/

    //
    (?: .* )
 |  
    /\*
    (?: . | \n )*?
    \*/

输出

 **  Grp 0 -  ( pos 0 , len 21 ) 
// my  program in C++  

---------------------------

 **  Grp 0 -  ( pos 43 , len 38 ) 
/** I love c++
    This is awesome **/  

---------------------------

 **  Grp 0 -  ( pos 143 , len 10 ) 
//use cout

用regex搜索C ++中的基本注释

4 个答案: