我希望搜索字符串中特定的单词序列。到目前为止,我已经能够找到它们(混杂在字符串中),但是无法按特定顺序找到它们。让我举例说明:
from nltk.tokenize import word_tokenize
negative_descriptors = ['no', 'unlikely', 'no evidence of']
diagnosis = 'disc prolapse'
report = 'There is no evidence of disc prolapse but this is evidence of a collection.'
def find_diagnosis(diagnosis, negative_descriptors, report):
keywords = word_tokenize(diagnosis)
if [keyword for keyword in keywords if keyword in report] == keywords:
if [descriptor for descriptor in negative_descriptors if descriptor in report]: return False
else: return True
在以上示例中,如果否定描述符AND诊断出现在报告中,则算法应返回False,并且否定描述符应在诊断之前出现在报告中(并且相隔不超过1个字)。
如何确保算法不仅考虑单词,还考虑单词的顺序?
答案 0 :(得分:0)
import re
negative_descriptors = ['no', 'unlikely', 'no evidence of']
diagnosis = 'disc prolapse'
report = 'There is no evidence of disc prolapse but this is evidence of a collection.'
if diagnosis in report:
for ng in negative_descriptors:
pattern = re.escape(ng) + r"[\s\w\s]{1}" + re.escape(diagnosis)
print(re.findall(pattern, report))
答案 1 :(得分:0)
如果否定描述符的限制很小,则可以使用|
来加入它们:
import re
negative_descriptors = ['no', 'unlikely', 'no evidence of']
diagnosis = 'disc prolapse'
report = 'There is no evidence of disc prolapse but this is no evidence of a collection.'
neg = '|'.join(negative_descriptors)
out = re.search("("+neg+")"+r".*?"+diagnosis,report)
print (not(out==None))