如何在字符串中搜索特定的单词序列?

时间:2018-08-18 16:05:31

标签: python regex nltk

我希望搜索字符串中特定的单词序列。到目前为止,我已经能够找到它们(混杂在字符串中),但是无法按特定顺序找到它们。让我举例说明:

from nltk.tokenize import word_tokenize
negative_descriptors = ['no', 'unlikely', 'no evidence of']
diagnosis = 'disc prolapse'
report = 'There is no evidence of disc prolapse but this is evidence of a collection.'

def find_diagnosis(diagnosis, negative_descriptors, report):
    keywords = word_tokenize(diagnosis)
    if [keyword for keyword in keywords if keyword in report] == keywords:
        if [descriptor for descriptor in negative_descriptors if descriptor in report]: return False
        else: return True

在以上示例中,如果否定描述符AND诊断出现在报告中,则算法应返回False,并且否定描述符应在诊断之前出现在报告中(并且相隔不超过1个字)。

如何确保算法不仅考虑单词,还考虑单词的顺序?

2 个答案:

答案 0 :(得分:0)

import re
negative_descriptors = ['no', 'unlikely', 'no evidence of']
diagnosis = 'disc prolapse'
report = 'There is no evidence of disc prolapse but this is evidence of a collection.'

if diagnosis in report:
    for ng in negative_descriptors:
         pattern = re.escape(ng) + r"[\s\w\s]{1}" + re.escape(diagnosis)
         print(re.findall(pattern, report))

答案 1 :(得分:0)

如果否定描述符的限制很小,则可以使用|来加入它们:

    import re
    negative_descriptors = ['no', 'unlikely', 'no evidence of']
    diagnosis = 'disc prolapse'
    report = 'There is no evidence of disc prolapse but this is no evidence of a collection.'
    neg = '|'.join(negative_descriptors)

    out = re.search("("+neg+")"+r".*?"+diagnosis,report)
    print (not(out==None))