Question

我有一个用例，我想用一个空格替换多个空格，除非它们出现在引号内。例如

原始

this is the first    a   b   c
this is the second    "a      b      c"

在

this is the first a b c
this is the second "a      b      c"

我相信正则表达式应该可以做到这一点，但我对它们没有多少经验。这是我已经拥有的一些代码

import re

str = 'this is the second    "a      b      c"'
# Replace all multiple spaces with single space
print re.sub('\s\s+', '\s', str)

# Doesn't work, but something like this
print re.sub('[\"]^.*\s\s+.*[\"]^, '\s', str)

我理解为什么我的第二个不起作用，所以就像一些替代方法。如果可能，您能解释一下正则表达式解决方案的各个部分吗？感谢

Answer 1

假设"

中没有"substring"

import re
str = 'a    b    c  "d   e   f"'  
str = re.sub(r'("[^"]*")|[ \t]+', lambda m: m.group(1) if m.group(1) else ' ', str)

print(str)
#'a b c "d   e   f"'

正则表达式("[^"]*")|[ \t]+将匹配带引号的子字符串或一个或多个单个空格或制表符。因为正则表达式首先匹配引用的子字符串，所以其中的空格将无法与备用子模式[ \t]+匹配，因此将被忽略。

与引用的子字符串匹配的模式包含在()中，因此回调可以检查它是否匹配。如果是，m.group(1)将是真实的，它的价值只是返回。如果不是，则它是匹配的空格，因此返回单个空格作为替换值。

没有lamda

def repl(match):
    quoted = match.group(1)
    return quoted if quoted else ' '

str = re.sub(r'("[^"]*")|[ \t]+', repl, str)

Answer 2

如果您想要一个每次都能可靠运行的解决方案，无论输入或其他警告如不允许使用嵌入式引号，那么您希望编写一个简单的解析器，而不是使用RegExp或拆分引号。

def parse(s):
    last = ''
    result = ''
    toggle = 0
    for c in s:
        if c == '"' and last != '\\':
            toggle ^= 1
        if c == ' ' and toggle == 0 and last == ' ':
            continue
        result += c
        last = c
    return result

test = r'"  <  >"test   1   2   3 "a \"<   >\"  b  c"'
print test
print parse(test)

如果多个空格不出现在引号之间，请用一个空格替换多个空格？

2 个答案: