我发现自己处于这种情况,我需要在Python中实现顺序模式匹配的算法。搜索数小时后,无法在互联网上找到任何正在运行的库/片段。
问题定义:
实现一个函数sequential_pattern_match
输入:标记,(有序的字符串集合)
输出:元组列表,每个元组=(任何子集的标记,标记)
域专家将定义匹配规则,通常使用正则表达式
测试(令牌) - >标签或无
示例:
输入:[" Singapore"," Python"," User"," Group"," is" ,"这里"]
输出:[([" Singapore"," Python"," User"," Group"],"组织& #34;),("是",' O'),("此处",' O')]
' O'意味着没有匹配。
冲突解决规则:
凭借我在算法和数据结构方面的专业知识,这是我的实现:
from itertools import ifilter, imap
MAX_PATTERN_LENGTH = 3
def test(tokens):
length = len(tokens)
if (length == 1):
if tokens[0] == "Nexium":
return "MEDICINE"
elif tokens[0] == "pain":
return "SYMPTOM"
elif (length == 2):
string = ' '.join(tokens)
if string == "Barium Swallow":
return "INTERVENTION"
elif string == "Swallow Test":
return "INTERVENTION"
else:
if ' '.join(tokens) == "pain in stomach":
return "SYMPTOM"
def _evaluate(tokens):
tag = test(tokens)
if tag:
return (tokens, tag)
elif len(tokens) == 1:
return (tokens, 'O')
def _splits(tokens):
return ((tokens[:i], tokens[i:]) for i in xrange(min(len(tokens), MAX_PATTERN_LENGTH), 0, -1))
def sequential_pattern_match(tokens):
return ifilter(bool, imap(_halves_match, _splits(tokens))).next()
def _halves_match(halves):
result = _evaluate(halves[0])
if result:
return [result] + (halves[1] and sequential_pattern_match(halves[1]))
if __name__ == "__main__":
tokens = "I went to a clinic to do a Barium Swallow Test because I had pain in stomach after taking Nexium".split()
output = sequential_pattern_match(tokens)
slashTags = ' '.join(t + '/' + tag for tokens, tag in output for t in tokens)
print(slashTags)
assert slashTags == "I/O went/O to/O a/O clinic/O to/O do/O a/O Barium/INTERVENTION Swallow/INTERVENTION Test/O because/O I/O had/O pain/SYMPTOM in/SYMPTOM stomach/SYMPTOM after/O taking/O Nexium/MEDICINE"
import timeit
t = timeit.Timer(
'sequential_pattern_match("I went to a clinic to do a Barium Swallow Test because I had pain in stomach after taking Nexium".split())',
'from __main__ import sequential_pattern_match'
)
print(t.repeat(3, 10000))
我认为它不会更快。不幸的是,它是用函数式编写的,可能不适合Python。您是否能够在OO或命令式风格中实现更快的实现?
(注意:我确信如果在C中实现它会更快,但目前我没有计划使用除Python以外的其他语言)
答案 0 :(得分:1)
def sequential_pattern_match(tokens):
for first, rest in _splits(tokens):
x = _halves_match(first, rest)
if x:
return x
def _splits(tokens):
for i in xrange(min(len(tokens), MAX_PATTERN_LENGTH), 0, -1):
yield tokens[:i], tokens[i:]
def _halves_match(first, rest):
tag = test(first)
if tag:
return [(first, tag)] + (rest and sequential_pattern_match(rest))
def test(tokens):
length = len(tokens)
if length == 1:
if tokens[0] == "Nexium":
return "MEDICINE"
elif tokens[0] == "pain":
return "SYMPTOM"
else:
return "O"
elif length == 2:
if tokens == ["Barium", "Swallow"]:
return "INTERVENTION"
elif tokens == ["Swallow", "Test"]:
return "INTERVENTION"
elif tokens == ["pain", "in", "stomach"]:
return "SYMPTOM"
使用简单的ifilter
循环替换了imap
,for
。使用for
yield
循环生成表达式。
我机器的时间缩短了:
答案 1 :(得分:0)
你的解决方案并不优雅。考虑从htql.net使用htql.RegEx。以下是您问题的部分解决方案:
tokens = "I went to a clinic to do a Barium Swallow Test because I had pain in stomach after taking Nexium".split()
symptoms = ['Nexium', 'pain', 'Barium Swallow', 'Swallow Test', 'pain in stomach']
import htql
a=htql.RegEx()
a.setNameSet('symptoms', symptoms)
a.reSearchList(tokens, '&[ws:symptoms]')
# [['Barium', 'Swallow'], ['pain', 'in', 'stomach'], ['Nexium']]
a.reSearchList(tokens, '&[ws:symptoms]', useindex=True)
# [(8L, 2L), (14L, 3L), (19L, 1L)]
您可以轻松地将其扩展到更复杂的场景。