我有一个很大的示例文本,例如:
“动脉高血压可能会影响预后 由于并发症导致的患者生存。 TENSTATEN进入 在预防性处理(加工)的框架内。 他(她,它)的报告(关系)效率/影响不受欢迎 重要。利尿剂,TENSTATEN的初衷药物, 是。治疗方案非常多。“
我试图在文本中以模糊的方式检测“为了生存而预测”。例如“参与生存的进程”也必须得到肯定的回答。
我研究了fuzzywuzzy,nltk和新的正则表达式模糊函数,但我找不到办法:
if [anything similar (>90%) to "that sentence"] in mybigtext:
print True
答案 0 :(得分:1)
以下情况并不理想,但应该让您入门。它使用nltk
首先将文本拆分为单词,然后生成一个包含所有单词的词干的集合,过滤任何单词。它为您的示例文本和示例查询执行此操作。
如果两个集合的交集包含查询中的所有单词,则认为它匹配。
import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
ps = PorterStemmer()
def get_word_set(text):
return set(ps.stem(word) for word in word_tokenize(text) if word not in stop_words)
text1 = "The arterial high blood pressure may engage the prognosis for survival of the patient as a result of complications. TENSTATEN enters within the framework of a preventive treatment(processing). His(Her,Its) report(relationship) efficiency / effects unwanted is important. diuretics, medicine of first intention of which TENSTATEN, is. The therapeutic alternatives are very numerous."
text2 = "The arterial high blood pressure may engage the for survival of the patient as a result of complications. TENSTATEN enters within the framework of a preventive treatment(processing). His(Her,Its) report(relationship) efficiency / effects unwanted is important. diuretics, medicine of first intention of which TENSTATEN, is. The therapeutic alternatives are very numerous."
query = "engage the prognosis for survival"
set_query = get_word_set(query)
for text in [text1, text2]:
set_text = get_word_set(text)
intersection = set_query & set_text
print "Query:", set_query
print "Test:", set_text
print "Intersection:", intersection
print "Match:", len(intersection) == len(set_query)
print
该脚本提供两个文本,一个通过而另一个没有,它会生成以下输出以显示它正在做什么:
Query: set([u'prognosi', u'engag', u'surviv'])
Test: set([u'medicin', u'prevent', u'effici', u'engag', u'Her', u'process', u'within', u'surviv', u'high', u'pressur', u'result', u'framework', u'diuret', u')', u'(', u',', u'/', u'.', u'numer', u'Hi', u'treatment', u'import', u'complic', u'altern', u'patient', u'relationship', u'may', u'arteri', u'effect', u'prognosi', u'intent', u'blood', u'report', u'The', u'TENSTATEN', u'unwant', u'It', u'therapeut', u'enter', u'first'])
Intersection: set([u'prognosi', u'engag', u'surviv'])
Match: True
Query: set([u'prognosi', u'engag', u'surviv'])
Test: set([u'medicin', u'prevent', u'effici', u'engag', u'Her', u'process', u'within', u'surviv', u'high', u'pressur', u'result', u'diuret', u')', u'(', u',', u'/', u'.', u'numer', u'Hi', u'treatment', u'import', u'complic', u'altern', u'patient', u'relationship', u'may', u'arteri', u'effect', u'framework', u'intent', u'blood', u'report', u'The', u'TENSTATEN', u'unwant', u'It', u'therapeut', u'enter', u'first'])
Intersection: set([u'engag', u'surviv'])
Match: False
答案 1 :(得分:1)
使用regex
模块,首先按句子分割然后测试模糊模式是否在句子中:
tgt="The arterial high blood pressure may engage the prognosis for survival of the patient as a result of complications. TENSTATEN enters within the framework of a preventive treatment(processing). His(Her,Its) report(relationship) efficiency / effects unwanted is important. diuretics, medicine of first intention of which TENSTATEN, is. The therapeutic alternatives are very numerous."
for sentence in regex.split(r'(?<=[.?!;])\s+(?=\p{Lu})', tgt):
pat=r'(?e)((?:has engage the progronosis of survival){e<%i})'
pat=pat % int(len(pat)/5)
m=regex.search(pat, sentence)
if m:
print "'{}'\n\tfuzzy matches\n'{}'\n\twith \n{} substitutions, {} insertions, {} deletions".format(pat,m.group(1), *m.fuzzy_counts)
打印:
'(?e)((?:has engage the progronosis of survival){e<10})'
fuzzy matches
'may engage the prognosis for survival'
with
3 substitutions, 1 insertions, 2 deletions
答案 2 :(得分:0)
下面有一个函数,如果文本中包含一个单词,它将显示一个匹配项。您可以通过即兴创作来检查文本中的完整短语。
这是我的功能:
def FuzzySearch(text, phrase):
"""Check if word in phrase is contained in text"""
phrases = phrase.split(" ")
for x in range(len(phrases)):
if phrases[x] in text:
print("Match! Found " + phrases[x] + " in text")
else:
continue