使用python查找出现在字符串索引之前的两个单词

时间:2019-07-12 14:03:48

标签: python regex string

给出文本,我想找到未知单词之前出现的单词

text="the women marathon unknown introduced at the summer olympics los angeles usa and unknown won"  
items=re.finditer('unknown',text).  #as there are 2 unknown
for i in items:  
   print(i.start()) #to get index of 2 unknown

输出为

19 
81

现在如何分别提取出现在两个未知数之前的单词?
对于第一个未知的女人,我应该去找女人。
对于第二个未知数,我应该去美国,然后

3 个答案:

答案 0 :(得分:1)

此表达式可能与此处所需的表达式接近:

([\s\S]*?)(\bunknown\b)

使用re.findall进行测试

import re

regex = r"([\s\S]*?)(unknown)"

test_str = "the women marathon unknown introduced at the summer olympics los angeles usa and unknown won"

print(re.findall(regex, test_str, re.MULTILINE))

使用re.finditer进行测试

import re

regex = r"([\s\S]*?)(unknown)"

test_str = "the women marathon unknown introduced at the summer olympics los angeles usa and unknown won"

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

this demo的右上角对表达式进行了说明,如果您想探索/简化/修改它,在this link中,您可以观察它如何与某些示例输入步骤匹配一步一步,如果您喜欢。

答案 1 :(得分:1)

简短方法:

import re

text = "the women marathon unknown introduced at the summer olympics los angeles usa and unknown won"
matches = re.finditer('(\S+\s+){2}(?=unknown)', text)
for m in matches:
   print(m.group())

输出:

women marathon 
usa and 

答案 2 :(得分:1)

不带re和带itertools.groupbydoc)的版本:

from itertools import groupby

text="the women marathon unknown introduced at the summer olympics los angeles usa and unknown won"

for v, g in groupby(text.split(), lambda k: k=='unknown'):
    if v:
        continue
    l = [*g]
    if len(l) > 1:
        print(l[-2:])

打印:

['women', 'marathon']
['usa', 'and']