Python中的序列模式匹配算法

时间:2013-10-13 07:23:35

标签: python algorithm nlp

我发现自己处于这种情况,我需要在Python中实现顺序模式匹配的算法。搜索数小时后,无法在互联网上找到任何正在运行的库/片段。

问题定义:

实现一个函数sequential_pattern_match

  

输入:标记,(有序的字符串集合)

     

输出:元组列表,每个元组=(任何子集的标记,标记)

域专家将定义匹配规则,通常使用正则表达式

  

测试(令牌) - >标签或无

示例:

  

输入:[" Singapore"," Python"," User"," Group"," is" ,"这里"]

     

输出:[([" Singapore"," Python"," User"," Group"],"组织& #34;),("是",' O'),("此处",' O')]

' O'意味着没有匹配。

冲突解决规则:

  1. 首先出现的匹配优先级更高。 例如"新加坡物业销售",如果可能有两场相互冲突的比赛,"新加坡物业"作为资产和"房地产销售"作为事件,然后使用第一个。
  2. 较长的匹配优先于较短的匹配。 例如"新加坡Python用户组"由于组织的优先级高于"新加坡"作为位置+" Python"作为语言。
  3. 凭借我在算法和数据结构方面的专业知识,这是我的实现:

    from itertools import ifilter, imap
    
    
    MAX_PATTERN_LENGTH = 3
    
    def test(tokens):
        length = len(tokens)
        if (length == 1):
            if tokens[0] == "Nexium":
                return "MEDICINE"
            elif tokens[0] == "pain":
                return "SYMPTOM"
        elif (length == 2):
            string = ' '.join(tokens)
            if string == "Barium Swallow":
                return "INTERVENTION"
            elif string == "Swallow Test":
                return "INTERVENTION"
        else:
            if ' '.join(tokens) == "pain in stomach":
                return "SYMPTOM"
    
    def _evaluate(tokens):
        tag = test(tokens)
        if tag:
            return (tokens, tag)
        elif len(tokens) == 1:
            return (tokens, 'O')
    
    def _splits(tokens):
        return ((tokens[:i], tokens[i:]) for i in xrange(min(len(tokens), MAX_PATTERN_LENGTH), 0, -1))
    
    def sequential_pattern_match(tokens):
        return ifilter(bool, imap(_halves_match, _splits(tokens))).next()
    
    def _halves_match(halves):
        result = _evaluate(halves[0])
        if result:
            return [result] + (halves[1] and sequential_pattern_match(halves[1]))
    
    if __name__ == "__main__":
        tokens = "I went to a clinic to do a Barium Swallow Test because I had pain in stomach after taking Nexium".split()
        output = sequential_pattern_match(tokens)
        slashTags = ' '.join(t + '/' + tag for tokens, tag in output for t in tokens)
        print(slashTags)
        assert slashTags == "I/O went/O to/O a/O clinic/O to/O do/O a/O Barium/INTERVENTION Swallow/INTERVENTION Test/O because/O I/O had/O pain/SYMPTOM in/SYMPTOM stomach/SYMPTOM after/O taking/O Nexium/MEDICINE"
    
        import timeit
        t = timeit.Timer(
            'sequential_pattern_match("I went to a clinic to do a Barium Swallow Test because I had pain in stomach after taking Nexium".split())',
            'from __main__ import sequential_pattern_match'
        )
        print(t.repeat(3, 10000))
    

    我认为它不会更快。不幸的是,它是用函数式编写的,可能不适合Python。您是否能够在OO或命令式风格中实现更快的实现?

    (注意:我确信如果在C中实现它会更快,但目前我没有计划使用除Python以外的其他语言)

2 个答案:

答案 0 :(得分:1)

def sequential_pattern_match(tokens):
    for first, rest in _splits(tokens):
        x = _halves_match(first, rest)
        if x:
            return x

def _splits(tokens):
    for i in xrange(min(len(tokens), MAX_PATTERN_LENGTH), 0, -1):
        yield tokens[:i], tokens[i:]

def _halves_match(first, rest):
    tag = test(first)
    if tag:
        return [(first, tag)] + (rest and sequential_pattern_match(rest))

def test(tokens):
    length = len(tokens)
    if length == 1:
        if tokens[0] == "Nexium":
            return "MEDICINE"
        elif tokens[0] == "pain":
            return "SYMPTOM"
        else:
            return "O"
    elif length == 2:
        if tokens == ["Barium", "Swallow"]:
            return "INTERVENTION"
        elif tokens == ["Swallow", "Test"]:
            return "INTERVENTION"
    elif tokens == ["pain", "in", "stomach"]:
        return "SYMPTOM"

使用简单的ifilter循环替换了imapfor。使用for yield循环生成表达式。

我机器的时间缩短了:

  • 1.02694065435 - > 0.708227394544 (Python 2.7.5)
  • 1.1575780184 - > 0.425939527209 (PyPy 2.1)

答案 1 :(得分:0)

你的解决方案并不优雅。考虑从htql.net使用htql.RegEx。以下是您问题的部分解决方案:

tokens = "I went to a clinic to do a Barium Swallow Test because I had pain in stomach after taking Nexium".split()
symptoms = ['Nexium', 'pain', 'Barium Swallow', 'Swallow Test', 'pain in stomach']

import htql
a=htql.RegEx()
a.setNameSet('symptoms', symptoms)

a.reSearchList(tokens, '&[ws:symptoms]')
# [['Barium', 'Swallow'], ['pain', 'in', 'stomach'], ['Nexium']]

a.reSearchList(tokens, '&[ws:symptoms]', useindex=True)
# [(8L, 2L), (14L, 3L), (19L, 1L)]

您可以轻松地将其扩展到更复杂的场景。