是否有Python脚本或工具可以从Python源中删除注释和文档字符串?
它应该照顾像:
这样的情况"""
aas
"""
def f():
m = {
u'x':
u'y'
} # faake docstring ;)
if 1:
'string' >> m
if 2:
'string' , m
if 3:
'string' > m
所以最后我提出了一个简单的脚本,它使用tokenize模块并删除注释标记。它似乎工作得很好,除了我无法在所有情况下删除文档字符串。看看你是否可以改进它以删除文档字符串。
import cStringIO
import tokenize
def remove_comments(src):
"""
This reads tokens using tokenize.generate_tokens and recombines them
using tokenize.untokenize, and skipping comment/docstring tokens in between
"""
f = cStringIO.StringIO(src)
class SkipException(Exception): pass
processed_tokens = []
last_token = None
# go thru all the tokens and try to skip comments and docstrings
for tok in tokenize.generate_tokens(f.readline):
t_type, t_string, t_srow_scol, t_erow_ecol, t_line = tok
try:
if t_type == tokenize.COMMENT:
raise SkipException()
elif t_type == tokenize.STRING:
if last_token is None or last_token[0] in [tokenize.INDENT]:
# FIXEME: this may remove valid strings too?
#raise SkipException()
pass
except SkipException:
pass
else:
processed_tokens.append(tok)
last_token = tok
return tokenize.untokenize(processed_tokens)
此外,我想在具有良好单元测试覆盖率的大量脚本上进行测试。你能建议这样一个开源项目吗?
答案 0 :(得分:16)
我是“ mygod的作者,他使用正则表达式写了一个python解释器...... ”(即pyminifier)提到at that link below =)。
我只是想插入并说我使用tokenizer模块(我发现这个问题=)发现了相当多的代码。
您会很高兴地注意到代码不再依赖于正则表达式,并且使用了tokenizer,效果很好。无论如何,这是来自pyminifier
的remove_comments_and_docstrings()
函数
(注意:它适用于先前发布的代码中断的边缘情况):
import cStringIO, tokenize
def remove_comments_and_docstrings(source):
"""
Returns 'source' minus comments and docstrings.
"""
io_obj = cStringIO.StringIO(source)
out = ""
prev_toktype = tokenize.INDENT
last_lineno = -1
last_col = 0
for tok in tokenize.generate_tokens(io_obj.readline):
token_type = tok[0]
token_string = tok[1]
start_line, start_col = tok[2]
end_line, end_col = tok[3]
ltext = tok[4]
# The following two conditionals preserve indentation.
# This is necessary because we're not using tokenize.untokenize()
# (because it spits out code with copious amounts of oddly-placed
# whitespace).
if start_line > last_lineno:
last_col = 0
if start_col > last_col:
out += (" " * (start_col - last_col))
# Remove comments:
if token_type == tokenize.COMMENT:
pass
# This series of conditionals removes docstrings:
elif token_type == tokenize.STRING:
if prev_toktype != tokenize.INDENT:
# This is likely a docstring; double-check we're not inside an operator:
if prev_toktype != tokenize.NEWLINE:
# Note regarding NEWLINE vs NL: The tokenize module
# differentiates between newlines that start a new statement
# and newlines inside of operators such as parens, brackes,
# and curly braces. Newlines inside of operators are
# NEWLINE and newlines that start new code are NL.
# Catch whole-module docstrings:
if start_col > 0:
# Unlabelled indentation means we're inside an operator
out += token_string
# Note regarding the INDENT token: The tokenize module does
# not label indentation inside of an operator (parens,
# brackets, and curly braces) as actual indentation.
# For example:
# def foo():
# "The spaces before this docstring are tokenize.INDENT"
# test = [
# "The spaces before this string do not get a token"
# ]
else:
out += token_string
prev_toktype = token_type
last_col = end_col
last_lineno = end_line
return out
答案 1 :(得分:8)
这就是工作:
""" Strip comments and docstrings from a file.
"""
import sys, token, tokenize
def do_file(fname):
""" Run on just one file.
"""
source = open(fname)
mod = open(fname + ",strip", "w")
prev_toktype = token.INDENT
first_line = None
last_lineno = -1
last_col = 0
tokgen = tokenize.generate_tokens(source.readline)
for toktype, ttext, (slineno, scol), (elineno, ecol), ltext in tokgen:
if 0: # Change to if 1 to see the tokens fly by.
print("%10s %-14s %-20r %r" % (
tokenize.tok_name.get(toktype, toktype),
"%d.%d-%d.%d" % (slineno, scol, elineno, ecol),
ttext, ltext
))
if slineno > last_lineno:
last_col = 0
if scol > last_col:
mod.write(" " * (scol - last_col))
if toktype == token.STRING and prev_toktype == token.INDENT:
# Docstring
mod.write("#--")
elif toktype == tokenize.COMMENT:
# Comment
mod.write("##\n")
else:
mod.write(ttext)
prev_toktype = toktype
last_col = ecol
last_lineno = elineno
if __name__ == '__main__':
do_file(sys.argv[1])
我在文档字符串和注释的位置留下了存根注释,因为它简化了代码。如果你完全删除它们,你还必须摆脱它们之前的缩进。
答案 2 :(得分:2)
此食谱here声称可以做你想做的事。还有其他一些事情。
答案 3 :(得分:2)
这里是Dan解决方案的一种修改,使其可以在Python3上运行+还删除了空行+使其可以立即使用:
import io, tokenize, re
def remove_comments_and_docstrings(source):
io_obj = io.StringIO(source)
out = ""
prev_toktype = tokenize.INDENT
last_lineno = -1
last_col = 0
for tok in tokenize.generate_tokens(io_obj.readline):
token_type = tok[0]
token_string = tok[1]
start_line, start_col = tok[2]
end_line, end_col = tok[3]
ltext = tok[4]
if start_line > last_lineno:
last_col = 0
if start_col > last_col:
out += (" " * (start_col - last_col))
if token_type == tokenize.COMMENT:
pass
elif token_type == tokenize.STRING:
if prev_toktype != tokenize.INDENT:
if prev_toktype != tokenize.NEWLINE:
if start_col > 0:
out += token_string
else:
out += token_string
prev_toktype = token_type
last_col = end_col
last_lineno = end_line
out = '\n'.join(l for l in out.splitlines() if l.strip())
return out
with open('test.py', 'r') as f:
print(remove_comments_and_docstrings(f.read()))
答案 4 :(得分:1)
尝试测试以NEWLINE结尾的每一个令牌。然后更正docstring的模式(包括它作为注释的情况,但不分配给__doc__
)我认为(假设匹配是从NEWLINE之后的文件开始执行):
( DEDENT+ | INDENT? ) STRING+ COMMENT? NEWLINE
这应该处理所有棘手的情况:字符串连接,行连续,模块/类/函数文档字符串,字符串后的同一行注释。注意,NL和NEWLINE标记之间存在差异,因此我们不需要担心表达式中单行的字符串。
答案 5 :(得分:0)
我刚刚使用了Dan McDougall提供的代码,我发现了两个问题。
我想我已经通过添加(返回之前)更多行来修复这两件事:
# Removing unneeded newlines from string
buffered_content = cStringIO.StringIO(content) # Takes the string generated by Dan McDougall's code as input
content_without_newlines = ""
previous_token_type = tokenize.NEWLINE
for tokens in tokenize.generate_tokens(buffered_content.readline):
token_type = tokens[0]
token_string = tokens[1]
if previous_token_type == tokenize.NL and token_type == tokenize.NL:
pass
else:
# add necessary spaces
prev_space = ''
next_space = ''
if token_string in ['and', 'as', 'or', 'in', 'is']:
prev_space = ' '
if token_string in ['and', 'del', 'from', 'not', 'while', 'as', 'elif', 'global', 'or', 'with', 'assert', 'if', 'yield', 'except', 'import', 'print', 'class', 'exec', 'in', 'raise', 'is', 'return', 'def', 'for', 'lambda']:
next_space = ' '
content_without_newlines += prev_space + token_string + next_space # This will be our new output!
previous_token_type = token_type
答案 6 :(得分:0)
我认为最好的方法是使用ast。
答案 7 :(得分:0)
我找到了使用ast和astunparse模块(可从pip获得)更简单的方法。它将代码文本转换为语法树,然后astunparse模块再次将代码打印回而无需注释。我不得不通过简单的匹配来去除文档字符串,但这似乎可行。我一直在浏览输出,到目前为止,该方法的唯一缺点是它将剥离代码中的所有换行符。
import ast, astunparse
with open('my_module.py') as f:
lines = astunparse.unparse(ast.parse(f.read())).split('\n')
for line in lines:
if line.lstrip()[:1] not in ("'", '"'):
print(line)