Python:用于字符串格式语法检查的解析器

时间:2015-09-01 07:58:42

标签: python regex parsing

我从Python开始,为了设计我要验证一个必须具有这种格式的字符串:

... AAA一个 AAA ... A(BBB ... B) AAA ... A(BBB ... B)CCC。C AAA ...一个(BBB ... B)CCC ... C(DDD ... d)

其中aaa..a,bbb ... b,ccc..c,ddd..d是整数。

字符串的长度应该是任意的。

字符串中没有空格。

只允许使用一个支架。

我已经将这个问题作为一个有两个状态的有限状态机来解决。

我想知道是否有最佳方法来解决此任务以及您对此的印象以及您的每一个提示。

就像边信息一样,我通过regexp进行了一些测试,但这似乎是一个递归模式验证问题,我不确定在Python中可以轻松做到,但我不是regexp的专家,但我想如果这个任务应该可行,可以用一行代码执行。

我可以通过fsm方法看到的主要优点是通知用户输入字符串中存在错误的位置,然后更容易(从用户的角度来看)检查和更正任务。

[编辑]我发现了一个错误的检测行为,现在代码被纠正了,不允许两个连续的支架组,例如10(200)(300)。 此外,我已将代码重新格式化为函数。


"""

String parser for string formatted as reported below:

aaa...a
aaa...a(bbb...b)
aaa...a(bbb...b)ccc...c(ddd...d)

where:
aaa...a, bbb...b = integer number

Not valid (some example)
()
(aaa...a)
aaa...a()
aaa...a(bbb...b)ccc...d
aaa...a((bbb....b))
"""

import sys
import re

def parse_string(buffer):
    # Checking loop
    state = 1
    old_state = 1
    next_state = 1
    strlen = len(buffer)
    initial = True
    success = False
    is_a_number = re.compile("[0-9]")
    for index, i in enumerate(buffer):

        car = i

        # State 1
        if (state == 1):
            if is_a_number.match(car):
                if (index != strlen-1):
                    # If is a number e not the last I've to wait for the next char "(" or number
                    next_state = 1
                else:
                    if (initial):
                    # If is a number and is also the last of the initial block -> I've finish to parse
                        success = True
                        break
                    else:
                        # Is the last number but not into the initial block of numbers -> error
                        success = False
                        break
            else:
                if (car == "("):
                    if (old_state == 2):
                        # Can't have two (...)(...) consecutively
                        success = False
                        break
                    if ((index == 0) or (index == strlen-1)):
                        # The ( can't be the first or the last char
                        success = False
                        break
                    else:
                        # Step to the next state
                        next_state = 2
                        initial = False
                else:
                    # Wrong char detected
                    success = False
                    break

        if (state == 2):
            if is_a_number.match(car):
                if (index != strlen-1):
                    # The char is a number and is not the last of the string
                    next_state = 2
                else:
                    # If is a number and is also the last I've a error due to a missing ")"
                    success = False
                    break
            else:
                if (car == ")"):
                    if (old_state == 1):
                        # The sequence () is not allowed
                        success = False
                        break
                    elif ((old_state == 2) and (index != strlen-1)):
                        # The previous char was a number
                        next_state = 1
                    else:
                        # I'm on the last char of the string
                        success = True
                        break
                else:
                    # Wrong char detected
                    success = False
                    break

        print("current state: "+ str(state) + " next_state: " + str(next_state))

        # Update the old and the new state
        old_state = state
        state = next_state

    return(success, state, index)

if __name__ == "__main__":

    # Get the string from the command line
    # The first argument (index = 0) is the script name, the supplied parameters start from the idex = 1
    number_cmd = len(sys.argv) - 1
    if (number_cmd != 1):
        print ("Error: request one string as input!")
        sys.exit(0)

    # Get the string
    buffer = sys.argv[1].strip()

    print("================================")
    print("Parsing: " + buffer)
    print("Checking with fsm")
    print("--------------------------------")

    # Parse the string
    success, state, index = parse_string(buffer)

    # Check result
    if (success):
        print("String validated!")
        print("================================")
    else:
        print("Syntax error detected in state: " + str(state) + "\n" + "position: " + str(buffer[:index+1]))
        print("================================")

    # Exit from script
    sys.exit(0)

2 个答案:

答案 0 :(得分:2)

有限状态机和正则表达式在表达能力上是等价的。它们都可以用来解析regular languages。因此,如果您的问题可以通过FSM解决,也可以使用正则表达式解决。

如果允许递归括号,如1(123(345)12),则它不是常规语言,FSM和正则表达式都不能解析字符串。但是根据你的描述和脚本,我猜不允许使用递归括号。正则表达式可以工作。

您的要求:

  1. 解析字符串并返回字符串是否有效。
  2. 如果字符串无效,请打印错误位置。
  3. 字符串不能以'开头(',并且不允许使用空括号'()'
  4. 要获得错误的精确位置,您不能使用一个正则表达式来匹配整个字符串。您可以使用正则表达式\(|\)拆分字符串,使用[0-9]+匹配每个细分。然后,您只需要确保括号匹配。

    这是我的剧本:

    import re
    
    def parse_input(s):
            s = s.strip()
            digits = re.compile("[0-9]+")
            segments = re.split("(\(|\))",s)
    
            if not segments:
                    print "Error: blank input"
                    return False
            if not segments[0]: # opens with parentheses
                    print "Error: cannot open with parenthese"
                    return False
    
            in_p = False
    
            def get_error_context(i):
                    prefix = segments[i-1] if i>0 else ""
                    suffix = segments[i+1] if i<len(segments)-1 else ""
                    return prefix + segments[i] + suffix
    
            for i, segment in enumerate(segments):
                    if not segment: # blank is not allowed within parentheses
                            if in_p:
                                    print "Error: empty parentheses not allowed, around '%s'"%get_error_context(i)
                                    return False
                            else:
                                    print "Error: no digits between ) and (, around '%s'"%get_error_context(i)
                                    return False
                    elif segment == "(":
                            if in_p:
                                    print "Error: recursive () not allowed, around '%s'"%get_error_context(i)
                                    return False
                            else:
                                    in_p = True
                    elif segment == ")":
                            if in_p:
                                    in_p = False
                            else:
                                    print "Error: ) with no matching (, around '%s'"%get_error_context(i)
                                    return False
                    elif not digits.match(segment):
                            print "Error: non digits, around '%s'"%get_error_context(i)
                            return False
            if in_p:
                    print "Error: input ends with an open parenthese, around '%s'"%get_error_context(i)
                    return False
            return True
    

    测试:

    >>> parse_input("12(345435)4332(34)")
    True
    >>> parse_input("(345435)4332(34)")
    Error: cannot open with parenthese
    False
    >>> parse_input("sdf(345435)4332()")
    Error: non digits, around 'sdf('
    False
    >>> parse_input("123(345435)4332()")
    Error: empty parentheses not allowed, around '()'
    False
    >>> parse_input("34324(345435)(34)")
    Error: no digits between ) and (, around ')('
    False
    >>> parse_input("123(344332()")
    Error: recursive () not allowed, around '344332('
    False
    >>> parse_input("12)3(3443)32(123")
    Error: ) with no matching (, around '12)3'
    False
    >>> parse_input("123(3443)32(123")
    Error: input ends with an open parenthese, around '(123'
    False
    

答案 1 :(得分:0)

这可以通过正则表达式完成。这是python中的一个例子,你也可以试试regex101

正则表达式:(\d+)(\(\d+\)(\d+(\(\d+\))?)?)?

这将是python代码:

import re
p = re.compile(ur'(\d+)(\(\d+\)(\d+(\(\d+\))?)?)?')
test_str = u"1000(20)30(345)"

re.match(p, test_str)

如果您想在输入1000(20)30(345)之后进行检查 您可以在正则表达式之前添加^,在结尾处添加$