我正在编写一个函数,该函数将截取一段文本,将该文本拆分为多个句子,然后在每个句子中搜索彼此之间一定距离内的两个单词。此功能还可以区分“当前”结构和“未来”结构,它们以否定和肯定的方式捕获。
在Regex101中,正则表达式工作正常,但在此Python函数中,它不返回任何匹配项。
我尝试逐步调试该函数,似乎没有输入问题。从PHP到Python Regex,我已经进行了所有必要的更改,所以我也不认为这是问题所在。
这是整个功能:
def scope_search(text, word_list1, word_list2, tense, prox=25):
import regex
# split the document into sentences, but ignore decimal points in the middle of the sentence
sentences = [each.lower().lstrip() for each in regex.split('(?!.*\d)\.', text) if each]
# if tense is 'present', the regex pattern will include negative lookahead for future language in the sentence
if tense == 'present':
pattern = '^(?!(hope|expect|will|going to|in the future|plan(ning)? on|anticipate?(ing)?|foresee(ing)?|forecasts?))(\\b({0})\\b.{{0,{1}}}\\b({2})\\b)|(\\b({2})\\b.{{0,{1}}}\\b({0})\\b)$'.format(word_list1,prox,word_list2)
# if the tense is 'future', the regex pattern will include positive lookahead for future language
elif tense == 'future':
pattern = '^(?=(hope|expect|will|going to|in the future|plan(ning)? on|anticipate?(ing)?|foresee(ing)?|forecasts?))(\\b({0})\\b.{{0,{1}}}\\b({2})\\b)|(\\b({2})\\b.{{0,{1}}}\\b({0})\\b)$'.format(word_list1,prox,word_list2)
matches = []
# search sentence by sentence for all relevant matches
for sentence in sentences:
matches.append(regex.findall(pattern, sentence))
matches = [each for each in list(matches[0]) if each]
return matches
我认为可能是字符串格式问题,但是在此功能之外也可以正常工作。
以下是我要搜索的两个单词列表:
word_list1 = (increase?|double?|high(er)?|strong|strength|grow|growth|grew|(go|goes|went|going) up)(s|ing)?
word_list2 = (AUM|FUM|assets under management|funds under management|shipments|basis points|earnings|sales|revenues?|deposits|orders?|(new )?participants?( counts?)?)
同样,所有这些在Regex101中都可以正常工作,因此我认为Regex本身不是问题。非常感谢您的帮助。