Question

我想清理一些使用python和regex从键盘记录的输入。特别是当使用退格键来修复错误时。

示例1：

[in]:  'Helloo<BckSp> world'
[out]: 'Hello world'

可以使用

完成

re.sub(r'.<BckSp>', '', 'Helloo<BckSp> world')

例2：
但是，当我有几个退格时，我不知道如何删除完全相同数量的字符：

[in]:  'Helllo<BckSp><BckSp>o world'
[out]: 'Hello world'

（此处我想在两个退格区之前删除＆＃39; l＆＃39;以及＆＃39; o＆＃39;

我可以简单地使用re.sub(r'[^>]<BckSp>', '', line)几次，直到没有<BckSp>为止，但我想找到一个更优雅/更快的解决方案。

有谁知道怎么做？

Answer 1

看起来Python不支持递归正则表达式。如果您可以使用其他语言，可以试试这个：

.(?R)?<BckSp>

请参阅：https://regex101.com/r/OirPNn/1

Answer 2

它不是非常有效，但您可以使用re模块执行此操作：

(?:[^<](?=[^<]*((?=(\1?))\2<BckSp>)))+\1

demo

这种方式你不必计算，模式只使用重复。

(?: 
    [^<] # a character to remove
    (?=  # lookahead to reach the corresponding <BckSp>
        [^<]* # skip characters until the first <BckSp>
        (  # capture group 1: contains the <BckSp>s
            (?=(\1?))\2 # emulate an atomic group in place of \1?+
                        # The idea is to add the <BcKSp>s already matched in the
                        # previous repetitions if any to be sure that the following
                        # <BckSp> isn't already associated with a character
            <BckSp> # corresponding <BckSp>
        )
    )
)+ # each time the group is repeated, the capture group 1 is growing with a new <BckSp>

\1 # matches all the consecutive <BckSp> and ensures that there's no more character
   # between the last character to remove and the first <BckSp>

您可以对正则表达式模块执行相同操作，但这次您不需要模拟所有格量词：

(?:[^<](?=[^<]*(\1?+<BckSp>)))+\1

demo

但是使用正则表达式模块，你也可以使用递归（正如@Fallenhero注意到的那样）：

[^<](?R)?<BckSp>

demo

Answer 3

由于不支持递归/子程序调用，Python re中没有原子组/占有量词，你可以删除这些字符，然后在循环中使用退格键：

import re
s = "Helllo\b\bo world"
r = re.compile("^\b+|[^\b]\b")
while r.search(s): 
    s = r.sub("", s)
print(s)

请参阅Python demo

"^\b+|[^\b]\b"模式将在字符串start处找到1+个退格字符（使用^\b+），[^\b]\b将找到除了后退空格之外的任何字符的所有非重叠事件退格。

如果将退格表示为某些enitity / tag（如文字<BckSp>），则采用相同的方法：

import re
s = "Helllo<BckSp><BckSp>o world"
r = re.compile("^(?:<BckSp>)+|.<BckSp>", flags=re.S)
while r.search(s): 
    s = r.sub("", s)
print(s)

请参阅another Python demo

Answer 4

如果标记是单个字符，你可以使用堆栈，它会在单次传递中给你结果：

s = "Helllo\b\bo world"
res = []

for c in s:
    if c == '\b':
        if res:
            del res[-1]
    else:
        res.append(c)

print(''.join(res)) # Hello world

如果标记字面上是'<BckSp>'或其他长度大于1的字符串，您可以使用replace将其替换为'\b'并使用上面的解决方案。这仅在您知道'\b'未在输入中出现时才有效。如果您无法指定替代字符，则可以使用split并处理结果：

s = 'Helllo<BckSp><BckSp>o world'
res = []

for part in s.split('<BckSp>'):
    if res:
        del res[-1]
    res.extend(part)

print(''.join(res)) # Hello world

Answer 5

稍微冗长，但您可以使用此lambda function计算<BckSp>次出现次数，并使用子字符串例程来获取最终输出。

>>> bk = '<BckSp>'

>>> s = 'Helllo<BckSp><BckSp>o world'
>>> print re.sub(r'(.*?)((?:' + bk + ')+)', lambda x: x.group(1)[0:len(x.group(1)) - len(x.group(2))/len(bk)], s)
Hello world

>>> s = 'Helloo<BckSp> world'
>>> print re.sub(r'(.*?)((?:' + bk + ')+)', lambda x: x.group(1)[0:len(x.group(1)) - len(x.group(2))/len(bk)], s)
Hello world

>>> s = 'Helloo<BckSp> worl<BckSp>d'
>>> print re.sub(r'(.*?)((?:' + bk + ')+)', lambda x: x.group(1)[0:len(x.group(1)) - len(x.group(2))/len(bk)], s)
Hello word

>>> s = 'Helllo<BckSp><BckSp>o world<BckSp><BckSp>k'
>>> print re.sub(r'(.*?)((?:' + bk + ')+)', lambda x: x.group(1)[0:len(x.group(1)) - len(x.group(2))/len(bk)], s)
Hello work

匹配相同数量的重复字符作为捕获组的重复

5 个答案: