Question

好吧，我目前正在使用Python的正则表达式库将以下字符串拆分为分号分隔字段组。

'key1:"this is a test phrase"; key2:"this is another test phrase"; key3:"ok this is a gotcha\; but you should get it";'

正则表达式：\s*([^;]+[^\\])\s*;

我目前正在使用上面的pcre，它工作正常，直到我遇到一个案例，其中一个短语被包含在其中一个短语中，如上所述。

如何将此表达式修改为仅拆分非转义分号？

Answer 1

这个基本版本是你想要忽略任何前面带有反斜杠的;的地方，不管其他什么。这相对简单：

\s*([^;]*[^;\\]);

如果您希望输入中的转义反斜杠被视为文字，那么这将使这一点变得棘手。例如：

"You may want to split here\\;"
"But not here\;"

如果您想要考虑这一点，请尝试（已修改）：

\s*((?:[^;\\]|\\.)+);

为什么这么复杂？因为如果允许转义反斜杠，那么你必须考虑这样的事情：

"0 slashes; 2 slashes\\; 5 slashes\\\\\; 6 slashes\\\\\\;"

每对加倍的反斜杠都将被视为文字\。这意味着;只有在它之前有一个奇数的反斜杠时才会被转义。所以上面的输入将按如下方式分组：

#1: '0 slashes'
#2: '2 slashes\'
#3: '5 slashes\\; 6 slashes\\\'

因此模式的不同部分：

\s*            #Whitespace
((?:
    [^;\\]     #One character that's not ; or \
  |            #Or...
    \\.        #A backslash followed by any character, even ; or another backslash
)+);           #Repeated one or more times, followed by ;

反斜杠后需要一个字符可确保第二个字符始终正确转义，即使它是另一个反斜杠。

Answer 2

如果字符串可能包含分号和转义引号（或转义任何），我建议解析每个有效的key:"value";序列。像这样：

import re
s = r'''
    key1:"this is a test phrase";
    key2:"this is another test phrase";
    key3:"ok this is a gotcha\; but you should get it";
    key4:"String with \" escaped quote";
    key5:"String with ; unescaped semi-colon";
    key6:"String with \\; escaped-escape before semi-colon";
    '''
result = re.findall(r'\w+:"[^"\\]*(?:\\.[^"\\]*)*";', s)
print (result)

请注意，这可以正确处理双引号字符串中的任何转义。

Python正则表达式：忽略逃脱的角色

2 个答案: