一方面有短语,另一方面有很多句子需要检查,以确保具有显示每个单词(index_start,index_end)位置的短语。
例如,
phrase: "red moon rises"
sentence: "red moon and purple moon are rises"
result:
1) ["red" (0, 3), "moon" (4, 8), "rises" (29,34)]
2) ["red" (0, 3), "moon" (20, 24), "rises" (29,34)]
在这里,我们有2个不同的单词“ moon”
另一个例子,
phrase: "Sonic collect rings"
sentence: "Not only Sonic likes to collect rings, Tails likes to collect rings too"
result:
1) ["Sonic" (9, 14), "collect" (24, 31), "rings" (32,37)]
2) ["Sonic" (9, 14), "collect" (24, 31), "rings" (62,67)]
3) ["Sonic" (9, 14), "collect" (54, 61), "rings" (62,67)]
最后一个例子,
phrase: "be smart"
sentence: "Donald always wanted to be clever and to be smart"
result:
1) ["be" (24, 26), "smart" (44, 49)]
2) ["be" (41, 43), "smart" (44, 49)]
我试图围绕它进行正则表达式,例如'sonic.*collects.*rings'
或非贪婪变体'sonic.*?collects.*?rings'
。但是这样的解决方案仅给出 1)和 3)结果。
我也尝试使用正向后看regex
来尝试第三方'(?<=(Sonic.*collect.*rings))'
模块,但它只捕获了3个捕获中的2个。
一些声音示例代码:
import re
# sonic example, extracting all results
text = ['Sonic', 'collect', 'rings']
builded_regex = '.*'.join([r'\b({})\b'.format(word) for word in text])
for result in re.finditer(builded_regex, 'Not only Sonic likes to collect rings, Tails likes to collect rings too'):
for i, word in enumerate(text):
print('"{}" {}'.format(word, result.regs[i + 1]), end=' ')
print('')
输出:
"Sonic" (9, 14) "collect" (54, 61) "rings" (62, 67)
这种任务的最佳解决方案是什么,我想知道是否存在使用正则表达式解决该问题的解决方案?
答案 0 :(得分:0)
import re
from itertools import product
from operator import itemgetter
phrase = "red moon rises".split() # split into words
search_space = "red moon and purple moon are rises"
all_word_locs = []
for word in phrase:
word_locs = []
for match in re.finditer(word, search_space): # find *all* occurances of word in the whole string
s, e = match.span()
word_locs.append((word, s, e - s)) # save the word and its location
all_word_locs.append((word_locs)) # gather all the found locations of each word
cart_prod = product(*all_word_locs) # use the cartesian product to find all combinations
for found in cart_prod:
locs = list(map(itemgetter(1), found)) # get the location of each found word
if all(x < y for x, y in zip(locs, locs[1:])):
print(found) # only print if the words are found in order
*我正在使用this检查单词的位置是否正确。
答案 1 :(得分:0)
尝试类似的方法(我没有用python编写):
regex reg = "/(Sonic).*(collect).*(rings)/i"
if(reg.match(myString).success)
myString.find("Sonic")....
首先,找到句子中是否存在该短语,并且顺序正确。
然后,捕获每个单词的所有引用。