Question

我使用python将用于NLP任务的多个处理工具连接在一起，但也可以捕获每一个的输出，以防万一发生故障并将其写入日志。

某些工具需要花费很多时间，并以进度百分比和回车符（<div class = "block" style = "background-color: purple"></div> <div class = "block" style = "background-color: lightblue"></div> <div class = "block" style = "background-color: lightgreen"></div> <div class = "block" style = "background-color: yellow"></div> <div class = "block" style = "background-color: red"></div>的形式输出其当前状态。他们执行许多步骤，因此他们混合了普通消息和进度消息。这有时会导致日志文件非常大，很难用\r查看。我的日志看起来像这样（为了快速前进）：

less

我想要的是一种在python中折叠这些字符串的简单方法。（我想也可以在管道完成后执行此操作，并用例如[DEBUG ] [FILE] [OUT] ^M4% done^M8% done^M12% done^M15% done^M19% done^M23% done^M27% done^M31% done^M35% done^M38% done^M42% done^M46% done^M50% done^M54% done^M58% done^M62% done^M65% done^M69% done^M73% done^M77% done^M81% done^M85% done^M88% done^M92% done^M96% done^M100% doneFinished ...替换进度消息）

我用于运行和捕获输出的代码如下：

sed

这是Python 2中一些较旧的代码（有趣的unicode字符串），我想重写为Python 3并加以改进。（我也愿意就如何实时处理输出而不是在完成所有事情时提出建议。 更新：范围太广，不完全属于我的问题）

我可以想到许多方法，但是不知道是否有现成的功能/库/等。但我找不到任何东西。（我的google-fu需要工作。）我发现的唯一内容就是删除CR / LF的方法，而不是视觉替换的字符串部分。因此，在我花时间重新安装车轮之前，我随时欢迎提出建议和改进。 ;-）

我的方法是使用正则表达式在import subprocess from tempfile import NamedTemporaryFile def run_command_of(command): try: out_file = NamedTemporaryFile(mode='w+b', delete=False, suffix='out') err_file = NamedTemporaryFile(mode='w+b', delete=False, suffix='err') debug('Redirecting command output to temp files ...', \ 'out =', out_file.name, ', err =', err_file.name) p = subprocess.Popen(command, shell=True, \ stdout=out_file, stderr=err_file) p.communicate() status = p.returncode def fr_gen(file): debug('Reading from %s ...' % file.name) file.seek(0) for line in file: # TODO: UnicodeDecodeError? # reload(sys) # sys.setdefaultencoding('utf-8') # unicode(line, 'utf-8') # no decoding ... yield line.decode('utf-8', errors='replace').rstrip() debug('Closing temp file %s' % file.name) file.close() os.unlink(file.name) return (fr_gen(out_file), fr_gen(err_file), status) except: from sys import exc_info error('Error while running command', command, exc_info()[0], exc_info()[1]) return (None, None, 1) def execute(command, check_retcode_null=False): debug('run command:', command) out, err, status = run_command_of(command) debug('-> exit status:', status) if out is not None: is_empty = True for line in out: is_empty = False debug('[FILE]', '[OUT]', line.encode('utf-8', errors='replace')) if is_empty: debug('execute: no output') else: debug('execute: no output?') if err is not None: is_empty = True for line in err: is_empty = False debug('[FILE]', '[ERR]', line.encode('utf-8', errors='replace')) if is_empty: debug('execute: no error-output') else: debug('execute: no error-output?') if check_retcode_null: return status == 0 return True之间的字符串/行中查找节并将其删除。可选地，我将为非常长的流程保留一个百分比值。像\r之类的东西。

注意：：How to pull the output of the most recent terminal command?的可能重复项它可能需要包装脚本。它仍然可以用于用\r([^\r]*\r)转换我的旧日志文件。我发现 /得到了一个满足我需要的简单python方式的建议。

Answer 1

我认为针对我的用例的解决方案就像以下代码片段一样简单：

# my data
segments = ['abcdef', '567', '1234', 'xy', '\n']
s = '\r'.join(segments)

# really fast approach:
last = s.rstrip().split('\r')[-1]

# or: simulate the overwrites
parts = s.rstrip().split('\r')
last = parts[-1]
last_len = len(last)
for part in reversed(parts):
    if len(part) > last_len:
        last = last + part[last_len]
        last_len = len(last)

# result
print(last)

由于对我的问题的评论，我可以更好/进一步完善我的要求。在我的情况下，唯一的控制字符是回车符（CR，\r），并且 tripleee 建议使用一种相当简单的解决方案。

为什么不只是\r之后的最后一部分？

的输出

echo -e "abcd\r12"

可能导致：

12cd

subprocess标记下的问题（在 tripleee 的评论中也提出了建议）应有助于实时/交错输出，但不在我当前关注的范围之内。我将测试最佳方法。我已经在需要时使用stdbuf来切换缓冲。

如何最好地捕获包含回车符的最终过程进度输出

1 个答案: