Question

我有一个字符串，我尝试创建一个正则表达式掩码，在给定偏移量的情况下显示N个字数。假设我有以下字符串：

"The quick, brown fox jumps over the lazy dog."

我当时想要显示3个单词：

偏移0："The quick, brown"
偏移1："quick, brown fox"
偏移2："brown fox jumps"
偏移3："fox jumps over"
偏移4："jumps over the"
偏移5："over the lazy"
偏移6："the lazy dog."

我正在使用Python而且我一直在使用以下简单的正则表达式来检测3个单词：

>>> import re
>>> s = "The quick, brown fox jumps over the lazy dog."
>>> re.search(r'(\w+\W*){3}', s).group()
'The quick, brown '

但我无法弄清楚如何使用一种面具来显示接下来的3个单词而不是开始的单词。我需要保持标点符号。

Answer 1

前缀匹配选项

您可以通过使用变量前缀正则表达式来跳过第一个offset单词，并将单词triplet捕获到一个组中来完成此工作。

这样的事情：

import re
s = "The quick, brown fox jumps over the lazy dog."

print re.search(r'(?:\w+\W*){0}((?:\w+\W*){3})', s).group(1)
# The quick, brown 
print re.search(r'(?:\w+\W*){1}((?:\w+\W*){3})', s).group(1)
# quick, brown fox      
print re.search(r'(?:\w+\W*){2}((?:\w+\W*){3})', s).group(1)
# brown fox jumps

让我们来看看模式：

 _"word"_      _"word"_
/        \    /        \
(?:\w+\W*){2}((?:\w+\W*){3})
             \_____________/
                group 1

这就是它所说的：匹配2字，然后捕获到第1组，匹配3字。

(?:...)构造用于重复分组，但它们是非捕获的。

参考

regular-expressions.info/Capturing Groups, Non-capturing Groups
- Repeating a Capturing Group vs Capturing a Repeated Group

关于“单词”模式的注释

应该说\w+\W*对于“单词”模式来说是一个糟糕的选择，如下例所示：

import re
s = "nothing"
print re.search(r'(\w+\W*){3}', s).group()
# nothing

没有3个单词，但正则表达式无论如何都能匹配，因为\W*允许空字符串匹配。

也许更好的模式是：

\w+(?:\W+|$)

即\w+后跟\W+或字符串$的结尾。

捕捉前瞻选项

正如Kobi在评论中所建议的那样，这个选项更简单，因为你只有一个静态模式。它使用findall来捕获所有匹配项（see on ideone.com）：

import re
s = "The quick, brown fox jumps over the lazy dog."

triplets = re.findall(r"\b(?=((?:\w+(?:\W+|$)){3}))", s)

print triplets
# ['The quick, brown ', 'quick, brown fox ', 'brown fox jumps ',
#  'fox jumps over ', 'jumps over the ', 'over the lazy ', 'the lazy dog.']

print triplets[3]
# fox jumps over

这是如何工作的，它匹配零宽度词边界\b，使用先行来捕获组1中的3个“单词”。

    ______lookahead______
   /      ___"word"__    \
  /      /           \    \
\b(?=((?:\w+(?:\W+|$)){3}))
     \___________________/
           group 1

参考

regular-expressions.info/Lookarounds

Answer 2

一个倾向是拆分字符串并选择切片：

words = re.split(r"\s+", s)
for i in range(len(words) - 2):
    print ' '.join(words[i:i+3])

当然，这确实假设您在单词之间只有单个空格，或者不关心是否所有空格序列都折叠成单个空格。

Answer 3

不需要正则表达式

>>> s = "The quick, brown fox jumps over the lazy dog."
>>> for offset in range(7):
...     print 'offset {0}: "{1}"'.format(offset, ' '.join(s.split()[offset:][:3]))
... 
offset 0: "The quick, brown"
offset 1: "quick, brown fox"
offset 2: "brown fox jumps"
offset 3: "fox jumps over"
offset 4: "jumps over the"
offset 5: "over the lazy"
offset 6: "the lazy dog."

Answer 4

我们在这里有两个正交问题：

如何分割字符串。
如何构建3个连续元素的组。

对于1，你可以使用正则表达式，或者其他人指出 - 一个简单的str.split就足够了。对于2，请注意您希望看起来与 itertools的配方中的pairwise抽象非常相似：

http://docs.python.org/library/itertools.html#recipes

所以我们编写了广义的n-函数：

import itertools

def nwise(iterable, n):
    """nwise(iter([1,2,3,4,5]), 3) -> (1,2,3), (2,3,4), (4,5,6)"""
    iterables = itertools.tee(iterable, n)
    slices = (itertools.islice(it, idx, None) for (idx, it) in enumerate(iterables))
    return itertools.izip(*slices)

我们最终得到了一个简单且模块化的代码：

>>> s = "The quick, brown fox jumps over the lazy dog."
>>> list(nwise(s.split(), 3))
[('The', 'quick,', 'brown'), ('quick,', 'brown', 'fox'), ('brown', 'fox', 'jumps'), ('fox', 'jumps', 'over'), ('jumps', 'over', 'the'), ('over', 'the', 'lazy'), ('the', 'lazy', 'dog.')]

或者按照您的要求：

>>> # also: map(" ".join, nwise(s.split(), 3))
>>> [" ".join(words) for words in nwise(s.split(), 3)]
['The quick, brown', 'quick, brown fox', 'brown fox jumps', 'fox jumps over', 'jumps over the', 'over the lazy', 'the lazy dog.']

字符串掩码和正则表达式的偏移量

4 个答案:

前缀匹配选项

参考

关于“单词”模式的注释

捕捉前瞻选项

参考