替代方法：交替拆分/捕获项目

Question

我的程序需要将自然语言文本拆分成句子。我在Python 3+中使用re.split制作了一个模拟句子分割器。它看起来像这样：

re.split('\D[.!?]\s[A-Z]|$|\d[!?]\s|\d[.]\s[A-Z]', content)

当模式出现时，我需要在空白处拆分句子。但是，代码应该在模式发生的位置拆分文本，而不是在空白处。它不会保存句子的最后一个字符，包括句子终结符。

“这是3号吗？文字继续......”

看起来像

“这是号码”和“他的文字继续......”

我是否可以指定在保留模式的同时分割数据的方式，还是必须寻找替代方案？

Answer 1

正如@jonrsharpe所说，可以使用环视来减少分割出来的字符数，例如减少一个字符数。例如，如果你不介意丢失空格字符，你可以使用类似的东西：

>>> re.split('\s(?=[A-Z])',content)
['Is this the number 3?', 'The text goes on...']

您可以使用空格分割，下一个字符为大写。但是T没有消耗，只有空间消耗。

替代方法：交替拆分/捕获项目

然而，您可以使用其他方法。如果你拆分，你吃内容，但你可以使用相同的正则表达式来生成匹配列表。这些匹配是放在其间的数据。在分割的项目之间merging 匹配，您可以重建完整列表：

from itertools import chain, izip
import re

def nonconsumesplit(regex,content):
    outer = re.split(regex,content)
    inner = re.findall(regex,content)+['']
    return [val for pair in zip(outer,inner) for val in pair]

结果是：

>>> nonconsumesplit('\D[.!?]\s[A-Z]|$|\d[!?]\s|\d[.]\s[A-Z]',content)
['Is this the number ', '3? ', 'The text goes on...', '']
>>> list(nonconsumesplit('\s',content))
['Is', ' ', 'this', ' ', 'the', ' ', 'number', ' ', '3?', ' ', 'The', ' ', 'text', ' ', 'goes', ' ', 'on...', '']

或者您可以使用字符串连接：

def nonconsumesplitconcat(regex,content):
    outer = re.split(regex,content)
    inner = re.findall(regex,content)+['']
    return [pair[0]+pair[1] for pair in zip(outer,inner)]

结果是：

>>> nonconsumesplitconcat('\D[.!?]\s[A-Z]|$|\d[!?]\s|\d[.]\s[A-Z]',content)
['Is this the number 3? ', 'The text goes on...']
>>> nonconsumesplitconcat('\s',content)
['Is ', 'this ', 'the ', 'number ', '3? ', 'The ', 'text ', 'goes ', 'on...']

使用正则表达式拆分更长的模式而不会丢失字符Python 3+

1 个答案:

替代方法：交替拆分/捕获项目