Question

我正在尝试提取两个短语之间的单词。例如，假设我有以下段落：

他们早餐吃稀饭后一天，他们走了粥冷却的时候把它们放进木头里走着一个小女孩走进屋子这个小女孩金色卷发落到她的腰部，每个人都叫她通过Goldilocks。

我想让小女孩和金色卷发之间的所有单词，以及这些单词前后的2个单词。

有一种简单的方法吗？我得到了词组开头的索引，但导致代码很长

Answer 1

import re
match = re.search(r'(\w+ \w+) porridge for (.+) golden curls (\w+ \w+)', text)
whole_match = match.group(0)
two_words_before = match.group(1)
phrase_in_middle = match.group(2)
two_words_after = match.group(3)

编辑

Regex用于查找little girl ... golden curls的多个实例：

matches = re.findall(r'(?=( (\w+ \w+) little girl (.+) golden curls (\w+ \w+)) )', text)  # use ?= for look-ahead in order to find overlapping matches
first_match = matches[0][1:]  # tuple has form (full_match, two_words_before, phrase_in_middle, two_words_after)
last_match = matches[-1][1:]  # as above

在两个词组之间查找单词

1 个答案:

编辑