Question

我想按单个换行符或空格组拆分字符串。除了''字符串外，我得到了结果。我如何消除这些？

编辑：我需要输出来保留空白组并在每个换行符上拆分。唯一不需要的是''。

In [208]: re.split('(\n|\ +)', 'many   fancy word \n\n    hello    \t   hi')
Out[208]: 
['many',
 '   ',
 'fancy',
 ' ',
 'word',
 ' ',
 '',
 '\n',
 '',
 '\n',
 '',
 '    ',
 'hello',
 '    ',
 '\t',
 '   ',
 'hi']

Answer 1

如果模式包含捕获组，那么这些分隔符将包含在结果列表中。

如果您没有使用捕获组或将捕获组（(...)）替换为非捕获组（(?:...)），则不包括分隔符。

# Not using group at all
>>> re.split('\n|\ +', 'many   fancy word \n\n    hello    \t   hi')
['many', 'fancy', 'word', '', '', '', 'hello', '\t', 'hi']


# Using non-capturing group
>>> re.split('(?:\n|\ +)', 'many   fancy word \n\n    hello    \t   hi')
['many', 'fancy', 'word', '', '', '', 'hello', '\t', 'hi']

引用re.split document：

按照模式的出现拆分字符串。 如果捕获括号在模式中使用，然后模式中的所有组的文本也作为结果列表的一部分返回。如果maxsplit非零，最多发生maxsplit分裂，并且字符串的其余部分是作为列表的最后一个元素返回。

更新根据问题编辑：

您可以使用filter(None, ..)过滤掉空字符串：

list(filter(None, re.split('(\n|\ +)', 'many fancy word \n\n hello \t hi')))

或使用re.findall修改后的模式：

re.findall('\n|\ +|[^\n ]+', 'many fancy word \n\n hello \t hi')
# `[^\n ]` matches any character that is not a newline nor a space.

为什么我在这里得到空字符串？

1 个答案: