Question

我正在尝试为位于长字符串中的python注释找出一个好的正则表达式。到目前为止我已经

了

正则表达式：

#(.?|\n)*

的字符串：

'### this is a comment\na = \'a string\'.toupper()\nprint a\n\na_var_name = " ${an.injection} "\nanother_var = " ${bn.injection} "\ndtabse_conn = " ${cn.injection} "\n\ndef do_something()\n    # this call outputs an xml stream of the current parameter dictionary.\n    paramtertools.print_header(params)\n\nfor i in xrange(256):    # wow another comment\n    print i**2\n\n'

我觉得有更好的方法可以从字符串中获取所有单独的注释，但我不是正则表达式的专家。有没有人有更好的解决方案？

Answer 1

如果你做两件事，正则表达式会正常工作：

删除所有字符串文字（因为它们可以包含#个字符）。
捕获以#字符开头的所有内容，然后继续行到该行的末尾。

以下是演示：

>>> from re import findall, sub
>>> string = '### this is a comment\na = \'a string\'.toupper()\nprint a\n\na_var_name = " ${an.injection} "\nanother_var = " ${bn.injection} "\ndtabse_conn = " ${cn.injection} "\n\ndef do_something()\n    # this call outputs an xml stream of the current parameter dictionary.\n    paramtertools.print_header(params)\n\nfor i in xrange(256):    # wow another comment\n    print i**2\n\n'
>>> findall("#.*", sub('(?s)\'.*?\'|".*?"', '', string))
['### this is a comment', '# this call outputs an xml stream of the current parameter dictionary.', '# wow another comment']
>>>

re.sub删除"..."或'...'形式的所有内容。这使您不必担心字符串文字内的注释。

(?s)设置dot-all flag，允许.匹配换行符。

最后，re.findall获取以#字符开头的所有内容并继续到该行的末尾。

要进行更完整的测试，请将此示例代码放在名为test.py的文件中：

# Comment 1  
for i in range(10): # Comment 2
    print('#foo')
    print("abc#bar")
    print("""
#hello
abcde#foo
""")  # Comment 3
    print('''#foo
    #foo''')  # Comment 4

上面给出的解决方案仍有效：

>>> from re import findall, sub
>>> string = open('test.py').read()
>>> findall("#.*", sub('(?s)\'.*?\'|".*?"', '', string))
['# Comment 1', '# Comment 2', '# Comment 3', '# Comment 4']
>>>

Answer 2

由于这是字符串中的python代码，我使用tokenize模块来解析它并提取注释：

import tokenize
import StringIO

text = '### this is a comment\na = \'a string\'.toupper()\nprint a\n\na_var_name = " ${an.injection} "\nanother_var = " ${bn.injection} "\ndtabse_conn = " ${cn.injection} "\n\ndef do_something():\n    # this call outputs an xml stream of the current parameter dictionary.\n    paramtertools.print_header(params)\n\nfor i in xrange(256):    # wow another comment\n    print i**2\n\n'

tokens = tokenize.generate_tokens(StringIO.StringIO(text).readline)
for toktype, ttext, (slineno, scol), (elineno, ecol), ltext in tokens:
    if toktype == tokenize.COMMENT:
        print ttext

打印：

### this is a comment
# this call outputs an xml stream of the current parameter dictionary.
# wow another comment

请注意，字符串中的代码语法错误：:函数定义后缺少do_something()。

另请注意，ast模块在此处不起作用，因为它不会保留评论。

Answer 3

从索引1的匹配组中获取评论。

(#+[^\\\n]*)

DEMO

示例代码：

import re
p = re.compile(ur'(#+[^\\\n]*)')
test_str = u"..."

re.findall(p, test_str)

匹配

1.  ### this is a comment
2.  # this call outputs an xml stream of the current parameter dictionary.
3.  # wow another comment

用于长字符串中的注释的Python正则表达式

3 个答案: