Question

我希望搜索字符串中特定的单词序列。到目前为止，我已经能够找到它们（混杂在字符串中），但是无法按特定顺序找到它们。让我举例说明：

from nltk.tokenize import word_tokenize
negative_descriptors = ['no', 'unlikely', 'no evidence of']
diagnosis = 'disc prolapse'
report = 'There is no evidence of disc prolapse but this is evidence of a collection.'

def find_diagnosis(diagnosis, negative_descriptors, report):
    keywords = word_tokenize(diagnosis)
    if [keyword for keyword in keywords if keyword in report] == keywords:
        if [descriptor for descriptor in negative_descriptors if descriptor in report]: return False
        else: return True

在以上示例中，如果否定描述符AND诊断出现在报告中，则算法应返回False，并且否定描述符应在诊断之前出现在报告中（并且相隔不超过1个字）。

如何确保算法不仅考虑单词，还考虑单词的顺序？

Answer 1

import re
negative_descriptors = ['no', 'unlikely', 'no evidence of']
diagnosis = 'disc prolapse'
report = 'There is no evidence of disc prolapse but this is evidence of a collection.'

if diagnosis in report:
    for ng in negative_descriptors:
         pattern = re.escape(ng) + r"[\s\w\s]{1}" + re.escape(diagnosis)
         print(re.findall(pattern, report))

Answer 2

如果否定描述符的限制很小，则可以使用|来加入它们：

    import re
    negative_descriptors = ['no', 'unlikely', 'no evidence of']
    diagnosis = 'disc prolapse'
    report = 'There is no evidence of disc prolapse but this is no evidence of a collection.'
    neg = '|'.join(negative_descriptors)

    out = re.search("("+neg+")"+r".*?"+diagnosis,report)
    print (not(out==None))

如何在字符串中搜索特定的单词序列？

2 个答案: