Question

我正在编写一个程序来自动编写一些C代码，（我正在编写将字符串解析为具有相同名称的枚举） C对字符串的处理并不是那么好。所以有些人一直在唠叨我尝试python。

我创建了一个应该删除C风格/* COMMENT */和//COMMENT的函数从一个字符串：这是代码：

def removeComments(string):
    re.sub(re.compile("/\*.*?\*/",re.DOTALL ) ,"" ,string) # remove all occurance streamed comments (/*COMMENT */) from string
    re.sub(re.compile("//.*?\n" ) ,"" ,string) # remove all occurance singleline comments (//COMMENT\n ) from string

所以我尝试了这段代码。

str="/* spam * spam */ eggs"
removeComments(str)
print str

它显然什么也没做。

关于我做错了什么的任何建议？

有一种说法我听过几次：

如果您遇到问题而尝试使用Regex解决问题，最终会遇到两个问题。

编辑：回首今年。（经过更多的解析经验）

我认为正则表达式可能是正确的解决方案。而这里使用的简单正则表达“足够好”。我可能没有在这个问题上强调这一点。这是针对单个特定文件的。那没有棘手的情况。我认为保持文件解析对于正则表达式而言要简单得多，而不是将正则表达式复杂化为一个不可读的符号汤，这样做的维护要少得多。

Answer 1

已经给出了很多答案但是; 怎么样"//comment-like strings inside quotes"？

OP正在询问如何使用正则表达式来做到这一点;所以：

def remove_comments(string):
    pattern = r"(\".*?\"|\'.*?\')|(/\*.*?\*/|//[^\r\n]*$)"
    # first group captures quoted strings (double or single)
    # second group captures comments (//single-line or /* multi-line */)
    regex = re.compile(pattern, re.MULTILINE|re.DOTALL)
    def _replacer(match):
        # if the 2nd group (capturing comments) is not None,
        # it means we have captured a non-quoted (real) comment string.
        if match.group(2) is not None:
            return "" # so we will return empty to remove the comment
        else: # otherwise, we will return the 1st group
            return match.group(1) # captured quoted-string
    return regex.sub(_replacer, string)

这将删除：

/* multi-line comments */
// single-line comments

不会删除：

String var1 = "this is /* not a comment. */";
char *var2 = "this is // not a comment, either.";
url = 'http://not.comment.com';

注意 ：这也适用于 Javascript 来源。

Answer 2

re.sub会返回一个字符串，因此将代码更改为以下内容会产生结果：

def removeComments(string):
    string = re.sub(re.compile("/\*.*?\*/",re.DOTALL ) ,"" ,string) # remove all occurrences streamed comments (/*COMMENT */) from string
    string = re.sub(re.compile("//.*?\n" ) ,"" ,string) # remove all occurrence single-line comments (//COMMENT\n ) from string
    return string

Answer 3

我建议使用像SimpleParse或PyParsing这样的REAL解析器。 SimpleParse要求您实际知道EBNF，但速度非常快。 PyParsing有类似EBNF的语法，但它适用于Python，并且可以轻松构建功能强大的解析器。

修改：

以下是在此上下文中使用PyParsing的简单示例：

>>> test = '/* spam * spam */ eggs'
>>> import pyparsing
>>> comment = pyparsing.nestedExpr("/*", "*/").suppress()
>>> print comment.transformString(test)         
' eggs'

以下是使用单行和多行注释的更复杂示例。

在：

/*
 * multiline comments
 * abc 2323jklj
 * this is the worst C code ever!!
*/
void
do_stuff ( int shoe, short foot ) {
    /* this is a comment
     * multiline again! 
     */
    exciting_function(whee);
} /* extraneous comment */

后：

>>> print comment.transformString(code)   

void
do_stuff ( int shoe, short foot ) {

     exciting_function(whee);
}

除非删除评论，否则会留下额外的换行符，但这可以解决。

Answer 4

我建议您阅读此页面，该页面对问题进行了非常详细的分析，并很好地理解了您的方法无效的原因：http://ostermiller.org/findcomment.html

简短版本：您正在寻找的正则表达式是：

(/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/)|(//.*)

这应该匹配两种类型的注释块。如果您遇到麻烦，请阅读我链接的页面。

Answer 5

在 Jathanism 之后使用 pyparsing 找到了另一个解决方案。

import pyparsing

test = """
/* Code my code
xx to remove comments in C++
or C or python */

include <iostream> // Some comment

int main (){
    cout << "hello world" << std::endl; // comment
}
"""
commentFilter = pyparsing.cppStyleComment.suppress()
# To filter python style comment, use
# commentFilter = pyparsing.pythonStyleComment.suppress()
# To filter C style comment, use
# commentFilter = pyparsing.cStyleComment.suppress()

newtest = commentFilter.transformString(test)
print(newest)

产生以下输出：

include <iostream> 

int main (){
    cout << "hello world" << std::endl; 
}

也可以使用pythonStyleComment、javaStyleComment、cppStyleComment。发现它非常有用。

Answer 6

你做错了。

正则表达式是Regular Languages，而C不是。

Answer 7

我看到你可能想要修改的几件事。

首先，Python按值传递对象，但某些对象类型是不可变的。字符串和整数属于这些不可变类型。因此，如果将字符串传递给函数，则在函数中对字符串所做的任何更改都不会影响您传入的字符串。您应该尝试返回字符串。此外，在removeComments（）函数中，您需要将re.sub（）返回的值赋给一个新变量 - 就像任何以字符串作为参数的函数一样，re.sub（）不会修改字符串。

其次，我会回应其他人对解析C代码的看法。正则表达式不是最好的方式。

Answer 8

mystring="""
blah1 /* comments with
multiline */

blah2
blah3
// double slashes comments
blah4 // some junk comments

"""
for s in mystring.split("*/"):
    s=s[:s.find("/*")]
    print s[:s.find("//")]

输出

$ ./python.py

blah1


blah2
blah3

Answer 9

正如我在其他评论中所指出的，注释嵌套并不是真正的问题（在C中，注释不会嵌套，尽管有一些编译器支持嵌套注释）。问题在于像字符串文字这样的东西，它可以包含与注释分隔符完全相同的字符序列而不是实际上是一个。

正如Mike Graham所说，适合这项工作的工具是词法分析员。解析器是不必要的，并且会有点矫枉过正，但是词法分析器是正确的。碰巧的是，我今早早些时候为C（和C ++）发布了（部分）lexer。它并不试图正确识别所有词汇元素（即所有关键词和运算符），但它完全足以删除注释。虽然它在“使用Python”方面没有任何好处，因为它完全用C语言编写（它早于我使用C ++而不是实验代码）。

Answer 10

该程序从给定文件中删除//和/ * * /的注释：

#! /usr/bin/python3
import sys
import re
if len(sys.argv)!=2:
     exit("Syntax:python3 exe18.py inputfile.cc ")
else:
     print ('The following files are given by you:',sys.argv[0],sys.argv[1])
with open(sys.argv[1],'r') as ifile:
    newstring=re.sub(r'/\*.*?\*/',' ',ifile.read(),flags=re.S)
with open(sys.argv[1],'w') as ifile:
    ifile.write(newstring)
print('/* */ have been removed from the inputfile')
with open(sys.argv[1],'r') as ifile:
      newstring1=re.sub(r'//.*',' ',ifile.read())
with open(sys.argv[1],'w') as ifile:
      ifile.write(newstring1)
print('// have been removed from the inputfile')

Answer 11

只需要添加另一个正则表达式，就必须删除*和;之间的任何内容。在python中

data = re.sub（re.compile（“ *。*？\;”，re.DOTALL），''，data）

在*之前有反斜杠，以转义元字符。

使用正则表达式从源文件中删除注释

11 个答案: