Question

我正在尝试正确地分割英语句子，我想出了下面的邪恶正则表达式：

(?<!\d|([A-Z]\.)|(\.[a-z]\.)|(\.\.\.)|etc\.|[Pp]rof\.|[Dd]r\.|[Mm]rs\.|[Mm]s\.|[Mm]z\.|[Mm]me\.)(?<=([\.!?])|(?<=([\.!?][\'\"])))[\s]+?(?=[\S])'

问题是，Python不断引发以下错误：


Traceback (most recent call last):
  File "", line 1, in 
  File "sp.py", line 55, in analyze
    self.sentences = re.split(god_awful_regex, self.inputstr.strip())
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/re.py", line 165, in split
    return _compile(pattern, 0).split(string, maxsplit)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/re.py", line 243, in _compile
    raise error, v # invalid expression
sre_constants.error: look-behind requires fixed-width pattern

为什么这不是有效的固定宽度正则表达式？我没有使用任何重复字符（*或+），只是|。

修改 @Anomie解决了这个问题 - 非常感谢！不幸的是，我不能使新的表达平衡：

(?<!(\d))(?<![A-Z]\.)(?<!\.[a-z]\.)(?<!(\.\.\.))(?<!etc\.)(?<![Pp]rof\.)(?<![Dd]r\.)(?<![Mm]rs\.)(?<![Mm]s\.)(?<![Mm]z\.)(?<![Mm]me\.)(?:(?<=[\.!?])|(?<=[\.!?][\'\"\]))[\s]+?(?=[\S])

就是我现在拥有的。（的数量与（s）的数量匹配，但是：

>>> god_awful_regex = r'''(?<!(\d))(?<![A-Z]\.)(?<!\.[a-z]\.)(?<!(\.\.\.))(?<!etc\.)(?<![Pp]rof\.)(?<![Dd]r\.)(?<![Mm]rs\.)(?<![Mm]s\.)(?<![Mm]z\.)(?<![Mm]me\.)(?:(?<=[\.!?])|(?<=[\.!?][\'\"\]))[\s]+?(?=[\S])'''
>>> god_awful_regex.count('(')
17
>>> god_awful_regex.count(')')
17
>>> god_awful_regex.count('[')
13
>>> god_awful_regex.count(']')
13

还有什么想法？

Answer 1

考虑这个子表达式：

(?<=([\.!?])|(?<=([\.!?][\'\"])))

|的左侧是一个字符，而正确的大小是零。您在较大的负面后卫中也有同样的问题，它可能是1,2,3,4或5个字符。

从逻辑上讲，(?<!A|B|C)的负面观察应该等同于一系列后视(?<!A)(?<!B)(?<!C)。 (?<=A|B|C)的正面观察应该等同于(?:(?<=A)|(?<=B)|(?<=C))。

Answer 2

这不能回答你的问题。但是，如果您想将文本拆分为句子，可能需要查看nltk，其中包括许多其他内容PunktSentenceTokenizer。这是一些示例标记器：

""" PunktSentenceTokenizer

A sentence tokenizer which uses an unsupervised algorithm to build a model
for abbreviation words, collocations, and words that start sentences; and then
uses that model to find sentence boundaries. This approach has been shown to
work well for many European languages. """

from nltk.tokenize.punkt import PunktSentenceTokenizer

tokenizer = PunktSentenceTokenizer()
print tokenizer.tokenize(__doc__)

# [' PunktSentenceTokenizer\n\nA sentence tokenizer which uses an unsupervised
# algorithm to build a model\nfor abbreviation words, collocations, and words
# that start sentences; and then\nuses that model to find sentence boundaries.',
# 'This approach has been shown to\nwork well for many European languages. ']

Answer 3

看起来你可能正在使用接近结尾的重复字符：

[\s]+?

除非我读错了。

<强>更新

或提到夜间鞭炮的垂直条，这个问题的第一个答案似乎证实：determine if regular expression only matches fixed-length strings

为什么这不是固定的宽度模式？

3 个答案: