Python RegEx代码可检测句子中的特定功能

时间:2018-08-22 18:57:11

标签: python regex nltk

我创建了一个简单的单词特征检测器。到目前为止,已经能够找到字符串的特定特征(混杂在其中),但是该算法与某些单词序列混淆了。让我举例说明:

from nltk.tokenize import word_tokenize
negative_descriptors = ['no', 'unlikely', 'no evidence of']
negative_descriptors = '|'.join(negative_descriptors)
negative_trailers = ['not present', 'not evident']
negative_trailers = '|'.join(negative_descriptors)

keywords = ['disc prolapse', 'vertebral osteomyelitis', 'collection']

def feature_match(message, keywords, negative_descriptors):
    if re.search(r"("+negative_descriptors+")" + r".*?" + r"("+keywords+")", message): return True
    if re.search(r"("+keywords+")" + r".*?" + r"("+negative_trailers+")", message): return True

以上针对以下消息返回True

message = 'There is no evidence of a collection.' 
message = 'A collection is not present.'

这是正确的,因为它暗示我正在寻找的关键字/条件不存在。但是,它为以下消息返回None

message = 'There is no evidence of disc prolapse, collection or vertebral osteomyelitis.'
message = 'There is no evidence of disc prolapse/vertebral osteomyelitis/ collection.'

第一条消息中似乎匹配“或椎骨骨髓炎”,第二条消息中似乎匹配“ /收集”,但是这是错误的,并暗示该消息显示“我正在寻找IS的状态'。它实际上应该返回“ True”。

如何防止这种情况?

1 个答案:

答案 0 :(得分:0)

您发布的代码有几个问题:

  1. negative_trailers = '|'.join(negative_descriptors)应该是negative_trailers = '|'.join(negative_trailers )
  2. 您还应该像其他列表一样将列表关键字转换为字符串,以便可以将其传递给正则表达式
  3. 在正则表达式中使用3倍“ r”是没有用的

这些更正之后,您的代码应如下所示:

negative_descriptors = ['no', 'unlikely', 'no evidence of']
negative_descriptors = '|'.join(negative_descriptors)
negative_trailers = ['not present', 'not evident']
negative_trailers = '|'.join(negative_trailers)

keywords = ['disc prolapse', 'vertebral osteomyelitis', 'collection']
keywords = '|'.join(keywords)

if re.search(r"("+negative_descriptors+").*("+keywords+")", message): neg_desc_present = True
if re.search(r"("+keywords+").*("+negative_trailers+")", message): neg_desc_present = True