正则表达式:选择彼此相邻的所有两个(主题标签)单词组

时间:2014-10-22 06:07:03

标签: python regex

我有一个示例字符串:

#water #atlantic ocean #sea

我希望使用正则表达式来选择彼此相邻的两个#标签字的所有组。将返回:

[[['#water']['#atlantic ocean']], [['#atlantic ocean']['#sea']]]

我不知道如何做这个正则表达式。我最接近的是: ([#] [A-ZA-Z \ S] + \ S')

只产生以下内容(在python中):

>>> regex.findall(string)
[u'#water ', u'#atlantic ocean ', u'#sea']

我已经尝试在最后添加一个{2},但这似乎与配对不匹配。关于如何实现这一点的任何想法?

3 个答案:

答案 0 :(得分:2)

对我来说,分割#(或空格后跟哈希)比使用复杂的正则表达式感觉更直观:

import re
expr = "#water #atlantic ocean #sea"
groups = filter(None, re.split(r' ?#', expr))
# another option is to use a split that doesn't require regex at all:
# groups = filter(None, map(str.strip, expr.split("#"))) 
res = []
for i, itm in enumerate(groups):
    if i < len(groups)-1:
        res.append(["#"+itm, "#"+groups[i + 1]])

print res  # [['#water', '#atlantic ocean'], ['#atlantic ocean', '#sea']]

答案 1 :(得分:1)

您需要使用positive lookahead按顺序进行重叠匹配。

(?=(#[A-Za-z]+(?:\s[A-Za-z]+)?\s#[A-Za-z]+(?:\s[A-Za-z]+)?))

DEMO

>>> import re
>>> s = "#water #atlantic ocean #sea"
>>> m = re.findall(r'(?=(#[A-Za-z]+(?:\s[A-Za-z]+)?\s#[A-Za-z]+(?:\s[A-Za-z]+)?))', s)
>>> print m
['#water #atlantic ocean', '#atlantic ocean #sea']

OR

>>> m = re.findall(r'(?=(#[A-Za-z]+(?:\s[A-Za-z]+)?)\s(#[A-Za-z]+(?:\s[A-Za-z]+)?))', s)
>>> print m
[('#water', '#atlantic ocean'), ('#atlantic ocean', '#sea')]

如果以下字词出现零次或多次,请在非捕获组后使用*代替?

>>> m = re.findall(r'(?=(#[A-Za-z]+(?:\s[A-Za-z]+)*)\s(#[A-Za-z]+(?:\s[A-Za-z]+)*))', s)
>>> print m
[('#water', '#atlantic ocean'), ('#atlantic ocean', '#sea')]

答案 2 :(得分:0)

(#[^#]*)(?=[^#]*(#[^#]*))

试试这个。这将提供所需的组。抓住捕获。

x="#water #atlantic ocean #sea"
print re.findall(r"(#[^#]*)(?=[^#]*(#[^#]*))",x)

输出:[('#water', '#atlantic ocean'), ('#atlantic ocean', '#sea')]

参见演示。

http://regex101.com/r/rQ6mK9/36