我有一个示例字符串:
#water #atlantic ocean #sea
我希望使用正则表达式来选择彼此相邻的两个#标签字的所有组。将返回:
[[['#water']['#atlantic ocean']], [['#atlantic ocean']['#sea']]]
我不知道如何做这个正则表达式。我最接近的是: ([#] [A-ZA-Z \ S] + \ S')
只产生以下内容(在python中):
>>> regex.findall(string)
[u'#water ', u'#atlantic ocean ', u'#sea']
我已经尝试在最后添加一个{2},但这似乎与配对不匹配。关于如何实现这一点的任何想法?
答案 0 :(得分:2)
对我来说,分割#
(或空格后跟哈希)比使用复杂的正则表达式感觉更直观:
import re
expr = "#water #atlantic ocean #sea"
groups = filter(None, re.split(r' ?#', expr))
# another option is to use a split that doesn't require regex at all:
# groups = filter(None, map(str.strip, expr.split("#")))
res = []
for i, itm in enumerate(groups):
if i < len(groups)-1:
res.append(["#"+itm, "#"+groups[i + 1]])
print res # [['#water', '#atlantic ocean'], ['#atlantic ocean', '#sea']]
答案 1 :(得分:1)
您需要使用positive lookahead按顺序进行重叠匹配。
(?=(#[A-Za-z]+(?:\s[A-Za-z]+)?\s#[A-Za-z]+(?:\s[A-Za-z]+)?))
>>> import re
>>> s = "#water #atlantic ocean #sea"
>>> m = re.findall(r'(?=(#[A-Za-z]+(?:\s[A-Za-z]+)?\s#[A-Za-z]+(?:\s[A-Za-z]+)?))', s)
>>> print m
['#water #atlantic ocean', '#atlantic ocean #sea']
OR
>>> m = re.findall(r'(?=(#[A-Za-z]+(?:\s[A-Za-z]+)?)\s(#[A-Za-z]+(?:\s[A-Za-z]+)?))', s)
>>> print m
[('#water', '#atlantic ocean'), ('#atlantic ocean', '#sea')]
如果以下字词出现零次或多次,请在非捕获组后使用*
代替?
。
>>> m = re.findall(r'(?=(#[A-Za-z]+(?:\s[A-Za-z]+)*)\s(#[A-Za-z]+(?:\s[A-Za-z]+)*))', s)
>>> print m
[('#water', '#atlantic ocean'), ('#atlantic ocean', '#sea')]
答案 2 :(得分:0)
(#[^#]*)(?=[^#]*(#[^#]*))
试试这个。这将提供所需的组。抓住捕获。
x="#water #atlantic ocean #sea"
print re.findall(r"(#[^#]*)(?=[^#]*(#[^#]*))",x)
输出:[('#water', '#atlantic ocean'), ('#atlantic ocean', '#sea')]
参见演示。