Question

我是Python的新手，我试图从URL文件中删除评论和注释行（每行一个URL）。我正在使用自定义ArgumentParser（argparse）并重写convert_arg_line_to_args以便： -

在行尾添加尾随评论，例如＆＃39; http://example.com＃评论＆＃39;
剥离线条是空的或整条线条例如＆＃39;＃此文件包含网址，每行一个＆＃39;

我能够成功删除尾随注释（1），但似乎无法删除空行或注释行（2）。整行注释和空行仍保留在我的文件列表中。

class CustomArgumentParser(argparse.ArgumentParser):
    def __init__(self, *args, **kwargs):
        super(CustomArgumentParser, self).__init__(*args, **kwargs)

    def convert_arg_line_to_args(self, line):
        '''Strip out comments from start points file'''
        if re.match('^#.*', line, 0) or re.match('^\s+$', line, 0):
            yield 
        arg = re.sub('\s+#.*$', '', line)
        yield arg

有没有办法删除空行和注释行？

示例输入文件是：

# Start points for the spider 
#
http://www.website1.com/News.html?typeid=8                                      # All news
http://www.website1.com/News.html?typeid=5                                      # Business

http://www.website2.com/News.html?category=All%20Category%20News
http://www.website2.com/News.html?category=Category2

原始代码将parse_args()的args返回为：

DEBUG:root:Args are: Namespace(URLs=['', '# Start points for the spider ', '', '#', 'http://www.website1.com/News.html?typeid=8', 'http://www.website1.com/News.html?typeid=5', 'http://www.website1.com/News.html?typeid=9', 'http://www.website1.com/News.html?typeid=10', 'http://www.website1.com/KeyInterviews.html', '', '', 'http://www.website2.com/News.html?category=All%20Category%20News', 'http://www.website2.com/News.html?category=Category2'], cacheDir='/tmp', debug_level=' 1', firstNPages=None, outputDir=None, storyType='news')

更改为生成一个空列表会给出：

DEBUG:root:Args are: Namespace(URLs=[[], '# Start points for the spider ', [], '#', 'http://www.website1.com/News.html?typeid=8', 'http://www.website1.com/News.html?typeid=5', [], '', 'http://www.website2.com/News.html?category=All%20Category%20News', 'http://www.website2.com/News.html?category=Category2'], cacheDir='/tmp', debug_level=' 1', firstNPages=None, outputDir=None, storyType='news')

我希望args看起来像：

DEBUG:root:Args are: Namespace(URLs=['http://www.website1.com/News.html?typeid=8', 'http://www.website1.com/News.html?typeid=5', 'http://www.website2.com/News.html?category=All%20Category%20News', 'http://www.website2.com/News.html?category=Category2'], cacheDir='/tmp', debug_level=' 1', firstNPages=None, outputDir=None, storyType='news')

也许无法以这种方式从输入文件中删除行。

Answer 1

请注意，语句def show @comments = @post.comments respond_with(@comments) end将产生yield值而不是产生任何值，因此空行返回一个参数列表，如None。

如果您希望解析器跳过一行，您应该返回一个空列表。如果要保留该参数，则应重新编写函数以返回要跳过的行[None]和[]（其中[url]是清理行）。

BTW ......你的第二个正则表达式与空行不匹配。它应该显示url以匹配ZERO或更多空格。

Answer 2

您的实现实际上使用的是generator而不是函数：使用yield关键字时，执行的每个 yield语句都会提供一个值。即使是裸yield也会生成值None。您没有提供任何内容或arg，而是返回一个提供[None, arg]或[""]（空字符串）的迭代。

def convert_arg_line_to_args(self, line):
    '''Strip out comments from start points file'''
    if re.match('^#.*', line, 0) or re.match('^\s+$', line, 0):
        yield # yield None **and proceed**
    arg = re.sub('\s+#.*$', '', line)
    yield arg # yield arg

对于初学者，您不需要在此处使用生成器：使用yield而不是return。请注意argparse需要一个可迭代的值 - 没有值的有效迭代是例如空列表[]。

def convert_arg_line_to_args(self, line):
    '''Strip out comments from start points file'''
    if re.match('^#.*', line, 0) or re.match('^\s+$', line, 0):
        return []  # return NO values, **and stop**
    arg = re.sub('\s+#.*$', '', line)
    return [arg] # return ONLY arg

这是使代码正常工作的最小修改。

现在，虽然正则表达式适用于这个用例，但它通常是矫枉过正的。 Python的str类具有内置的高效操作和检查方法：您可以删除注释，清理空白并查看是否还有其他内容。

def convert_arg_line_to_args(self, line):
    '''Strip out comments from start points file'''
    line, *_ = line.split('#', maxsplit=1)  # the `*_` consumes any optional comment content
    arg = line.strip()  # remove whitespace - we have just the bare argument now
    if arg:  # is there anything left as an argument?
        return [arg] # return ONLY arg, and stop
    return []

如果你想探索发电机与功能，那么发电机实际上稍微优雅一些。我们在所有地方添加了[]列表，因为argparse需要一个可迭代的 - 但是生成器已经是可迭代的。

这在实践中意味着什么？如果有一个参数，只有yield它 - 它将被“包含”在生成器本身中。如果没有参数，永远不会yield - 发电机将在没有提供任何东西的情况下停止。

def convert_arg_line_to_args(self, line):
    '''Strip out comments from start points file'''
    line, *_ = line.split('#', maxsplit=1)
    arg = line.strip()
    if arg:
        yield arg # return arg, but continue... to stop immediately

Python如何使用ArgumentParser.convert_arg_line_to_args

2 个答案: