Question

如果我有一个字符串：

The quick brown fox jumps over the lazy dog.

我可以执行：

'The quick brown fox jumps over the lazy dog'.split(' ')
    => ['The','quick','brown','fox','jumps','over','the','lazy','dog.']

但现在，我们假设我有一个这样的字符串：

'The [quick brown fox] jumps over the [lazy dog.]'

我想要这个结果：

['The','[quick brown fox]','jumps','over','the','[lazy dog.]']

分割' '字符显然会产生：

['The','[quick','brown','fox]','jumps','over','the','[lazy','dog.]']

另一个例子，我们可能经常在CSV解析中看到：

'The,[quick,brown,fox],jumps,over,the,[lazy,dog.]'.somehow_split_with_delimiters()
    => ['The','[quick brown fox]','jumps','over','the','[lazy dog.]']

总结：我想分割一个字符串，但是我想允许一个分隔符，当使用它时，跳过将字符串拆分为在“escape”分隔符中找到的分割分隔符。

我现在唯一的解决方案是通过char解析字符串char并构建列表：

delimiter = ','
final_parts = []
temp_string = ''
in_escape = False
for ch in myString:
    if (ch == '['):
        in_escape = True
    if (ch == ',' and in_escape = False):
        final_parts.append(temp_string)
        temp_string=''
    else:
        temp_string += ch
    if (ch == ']'):
        in_escape = False
return final_parts

或首先拆分列表然后迭代它寻找分隔符来组合结果：

initial_parts = 'The [quick brown fox] jumps over the [lazy dog.]'.split(' ')
final_parts = []
temp_part = ''
in_escape = False
for part in initial_parts:
    if (part[0] == '['):
        in_escape = True
    if (in_escape = True):
        temp_part += part + ' '
    else:
        final_parts.append(part)            
    if (part[-1] == ']'):
        in_escape = False
        final_parts.append(temp_part.strip(' '))
return final_parts

这两种方法看起来都非常笨重且容易出错（而且当我快速编写它们时，我可能还有许多错误。）它们也没有说明可能逃脱转义定界符本身（例如它们赢了' t帐户\[或\]表示该字符不表示转义参数的开头）

感觉应该有一种更简单的方法来进行字符串拆分，同时允许转义字符。例如贝壳一直这样做; cp my file.txt my new file.txt会产生无关的参数，但cp "my file.txt" "my new file.txt"可以逃避。

Answer 1

我的方法是使用正则表达式。我必须处理两种情况：单个字(\w+)或方括号字(\[[^\]]\])组。

s = 'The [quick brown fox] jumps over the [lazy dog.]'

import re

pattern = re.compile(r'(\w+)|(\[[^\]]+\])')

pattern.findall(s)
Out[32]: 
[('The', ''),
 ('', '[quick brown fox]'),
 ('jumps', ''),
 ('over', ''),
 ('the', ''),
 ('', '[lazy dog.]')]

[a or b for a, b in pattern.findall(s)]
Out[33]: ['The', '[quick brown fox]', 'jumps', 'over', 'the', '[lazy dog.]']

注意在Out[32]结果中，我们得到了第一个模式或第二个模式的列表。从这个元组列表到字符串列表的一种方法在下一行使用or技巧显示：表达式a or b将返回两个非空字符串。

Answer 2

我首先在[]上进行正则表达式拆分，然后处理子部分。沿着这些方向：

>>> s = 'The [quick brown fox] jumps over the [lazy dog.]'
>>> def bracket_split(delim, string):
...   initial = re.split('[\[\]]', string)
...   result = []
...   for s in initial:
...     if not s: continue # throw away blank strings
...     if s.startswith(delim) or s.endswith(delim):
...       result.extend(s.strip(delim).split(delim))
...     else:
...       result.append(s.join('[]'))
...   return result
... 
>>> 
>>> bracket_split(' ', s)
['The', '[quick brown fox]', 'jumps', 'over', 'the', '[lazy dog.]']

但我会是第一个承认它很脆弱的人。 '[ this would break ]因为分隔符在括号内。{/ p>

Answer 3

使用匹配方括号对的正则表达式，包括任何包含的字符或非空白字符序列。模式将是：

\[.*?\]|\S+

像这样使用：

>>> pattern = r'\[.*?\]|\S+'
>>> s = 'The [quick brown fox] jumps over the [lazy dog.]'
>>> re.findall(r'\[.*?\]|\S+', s)
['The', '[quick brown fox]', 'jumps', 'over', 'the', '[lazy dog.]']

这是一种相当简单的方法，忽略了嵌套方括号等可能性。模式中备选方案的顺序非常重要，因为首先尝试括号匹配。

您可以在此处试用：https://regex101.com/r/ZizX3q/1

对于CSV示例，您可以将模式更改为：

\[.*?\]|[^,]+

匹配成对括号的内容或任何非脱离字符序列，在本例中为逗号：

>>> pattern = r'\[.*?\]|[^,]+'
>>> s = 'The,[quick,brown,fox],jumps,over,the,[lazy,dog.]'
>>> re.findall(pattern, s)
['The', '[quick,brown,fox]', 'jumps', 'over', 'the', '[lazy,dog.]']

BTW我认为CSV示例的预期输出是错误的：它删除了括号内的逗号，例如： '[快速棕色狐狸]'，但我认为逗号应该保留。

如何在空格上拆分字符串但允许转义非拆分区域？（蟒蛇）

3 个答案:

如何在空格上拆分字符串但允许转义非拆分区域？ （蟒蛇）

3 个答案:

如何在空格上拆分字符串但允许转义非拆分区域？（蟒蛇）