确定一行是否具有在python中未关闭的括号或引号

时间:2012-08-09 19:19:36

标签: python regex string

我正在寻找一种打开文件的简单方法,并搜索每一行以查看该行是否有未闭合的parens和引号。如果该行具有未闭合的parens / quotes,我想将该行打印到文件中。我知道我可以用一个丑陋的if / for语句来做这件事,但我知道python可能有更好的方法与re模块(我什么都不知道)或其他什么但我不知道语言足够好这样做。

谢谢!

编辑:一些示例行。如果将其复制到记事本或其他内容并关闭自动换行(某些行可能很长),可能更容易阅读。此外,文件中有超过100k行,所以效果会很棒!

SL  ID=0X14429A0B TY=STANDARD OWN=0X429A03 EXT=22 SLTK=0X1C429A0B MP=0X684003F0 SUB=0X24400007
RT  ID=0X18429A19 TY=CALONSC OWN=0X14429A0B EXLP=0X14429A0C CMDS=(N:0X8429A04,C:0X14429A0B) SGCC=2 REL=1 DESC="AURANT YD-INDSTRY LD" ATIS=T
RT  ID=0X18429A1A TY=CALONSC OWN=0X14429A0B EXLP=0X14429A08 CMDS=(R:0X8429A04,N:0X8429A05,C:0X14429A0B) SGCC=2 REL=2 DESC="AURANT YD TO TRK.1" ATIS=T
RT  ID=0X18429A1B TY=CALONSC OWN=0X14429A0B EXLP=0X14429A0A CMDS=(R:0X8429A04,R:0X8429A05,C:0X14429A0B) SGCC=2 REL=3 DESC="AURANT YD TO TRK.2" ATIS=T
SL  ID=0X14429A0C TY=STANDARD OWN=0X429A03 EXT=24 SLTK=0X1C429A0B MP=0X684003F1 SUB=0X24400007
RT  ID=0X18429A1C TY=CALONSC OWN=0X14429A0C EXLP=0X14429A0B CMDS=(N:0X8429A04,C:0X14429A0C) SGCC=2 REL=1 DESC="AURANT YD-INDSTRY LD" ATIS=T
TK  ID=0X1C429A08 TY=BLKTK OWN=0X429A03 EXT=12 LRMP=0X6C40BDAF LEN=5837 FSPD=60 PSPD=65 QUAL=TRK.1 MAXGE=0 MAXGW=0 JAL=4 ALT=12 SUB=0X24400007 RULES=(CTC:B:UP:0X24400007:485.7305:486.8359:T) LLON=-118.1766772 RLON=-118.1620059 LLAT=34.06838375 RLAT=34.07811764 LELE=416.6983 RELE=425.0596 ULAD=NO URAD=NO
PT  ID=0X20429A0F TY=STANDARD OWN=0X1C429A08 LTK=0X1C40006C RTK=0X1C429A0C REL=1 LEN=1 LQUAL="TRK.1" RQUAL="TRK.1"
PTK OWN=0X1C429A08 PID=0X1C429A13

8 个答案:

答案 0 :(得分:6)

如果你不认为会有倒退的不匹配的parens(即“)”,你可以这样做:

with open("myFile.txt","r") as readfile, open("outFile.txt","w") as outfile:
    for line in readfile:
        if line.count("(") != line.count(")") or line.count('"') % 2 != 0:
            outfile.write(line)

否则你必须一次计算一次以检查是否存在不匹配,例如:

with open("myFile.txt","r") as readfile, open("outFile.txt","w") as outfile:
    for line in readfile:
        count = 0
        for char in line:
            if char == ")":
                count -= 1
            elif char == "(":
                count += 1
            if count < 0:
                break
         if count != 0 or text.count('"') % 2 != 0:
             outfile.write(line)

我想不出更好的办法来处理它。 Python不支持递归正则表达式,所以正则表达式解决方案就出来了。

关于这一点还有一件事:给定你的数据,最好将它放入一个函数并拆分你的字符串,这很容易用正则表达式,如下所示:

import re
splitre = re.compile(".*?=(.*?)(?:(?=\s*?\S*?=)|(?=\s*$))")
with open("myFile.txt","r") as readfile, open("outFile.txt","w") as outfile:
    for line in readfile:
        def matchParens(text):
            count = 0
            for char in text:
                if char == ")":
                    count -= 1
                elif char == "(":
                    count += 1
                if count < 0:
                    break
            return count != 0 or text.count('"') % 2 != 0
        if any(matchParens(text) for text in splitre.findall(line)):
            outfile.write(line)

可能更好的原因是它会单独检查每个值对,如果你在一个值对中有一个开放的paren而在后一个值中有一个close paren,那么它不会认为没有不平衡括号。

答案 1 :(得分:5)

使用解析器包似乎有些过分,但它很快:

text = """\
SL  ID=0X14429A0B TY=STANDARD OWN=0X429A03 EXT=22 SLTK=0X1C429A0B MP=0X684003F0 SUB=0X24400007
RT  ID=0X18429A19 TY=CALONSC OWN=0X14429A0B EXLP=0X14429A0C CMDS=(N:0X8429A04,C:0X14429A0B) SGCC=2 REL=1 DESC="AURANT YD-INDSTRY LD" ATIS=T
RT  ID=0X18429A1A TY=CALONSC OWN=0X14429A0B EXLP=0X14429A08 CMDS=(R:0X8429A04,N:0X8429A05,C:0X14429A0B) SGCC=2 REL=2 DESC="AURANT YD TO TRK.1" ATIS=T
RT  ID=0X18429A1B TY=CALONSC OWN=0X14429A0B EXLP=0X14429A0A CMDS=(R:0X8429A04,R:0X8429A05,C:0X14429A0B) SGCC=2 REL=3 DESC="AURANT YD TO TRK.2" ATIS=T
SL  ID=0X14429A0C TY=STANDARD OWN=0X429A03 EXT=24 SLTK=0X1C429A0B MP=0X684003F1 SUB=0X24400007
RT  ID=0X18429A1C TY=CALONSC OWN=0X14429A0C EXLP=0X14429A0B CMDS=(N:0X8429A04,C:0X14429A0C) SGCC=2 REL=1 DESC="AURANT YD-INDSTRY LD" ATIS=T
TK  ID=0X1C429A08 TY=BLKTK OWN=0X429A03 EXT=12 LRMP=0X6C40BDAF LEN=5837 FSPD=60 PSPD=65 QUAL=TRK.1 MAXGE=0 MAXGW=0 JAL=4 ALT=12 SUB=0X24400007 RULES=(CTC:B:UP:0X24400007:485.7305:486.8359:T) LLON=-118.1766772 RLON=-118.1620059 LLAT=34.06838375 RLAT=34.07811764 LELE=416.6983 RELE=425.0596 ULAD=NO URAD=NO
PT  ID=0X20429A0F TY=STANDARD OWN=0X1C429A08 LTK=0X1C40006C RTK=0X1C429A0C REL=1 LEN=1 LQUAL="TRK.1" RQUAL="TRK.1"
PTK OWN=0X1C429A08 PID=0X1C429A13 GOOD
PTK OWN=0X1C429A(08 PID=0X1C429A13 BAD
PTK OWN=0X1C429A08 )PID=0X1C429A13 BAD
PTK OWN=0X1C(42(9A))08 PID=0X1C429A13 GOOD
PTK OWN=0X1C(42(9A))08 PID=0X1C42(9A13 BAD
PTK OWN=0X1C(42(9A))08 PID=0X1C42"("9A13 GOOD
"""

from pyparsing import nestedExpr, quotedString

paired_exprs = nestedExpr('(',')')  |  quotedString

for i, line in enumerate(text.splitlines(), start=1):
    # use pyparsing expression to strip out properly nested quotes/parentheses
    stripped_line = paired_exprs.suppress().transformString(line)

    # if there are any quotes or parentheses left, they were not 
    # properly nested
    if any(unwanted in stripped_line for unwanted in '()"\''):
        print i, ':', line

打印:

10 : PTK OWN=0X1C429A(08 PID=0X1C429A13 BAD
11 : PTK OWN=0X1C429A08 )PID=0X1C429A13 BAD
13 : PTK OWN=0X1C(42(9A))08 PID=0X1C42(9A13 BAD

答案 2 :(得分:3)

  1. 只需从一行中提取所有有趣的符号。
  2. 将开口符号推入堆栈,并在每次获得时从堆栈弹出 关闭符号。
  3. 如果堆栈干净,则符号是平衡的。如果 堆栈下溢或没有完全展开你有不平衡线。
  4. 检查一行的示例代码如下 - 我在第一行插入了一个迷路括号。

    d = """SL  ID=0X14429A0B TY=STANDARD OWN=0X429A(03 EXT=22 SLTK=0X1C429A0B MP=0X684003F0 SUB=0X24400007
    RT  ID=0X18429A19 TY=CALONSC OWN=0X14429A0B EXLP=0X14429A0C CMDS=(N:0X8429A04,C:0X14429A0B) SGCC=2 REL=1 DESC="AURANT YD-INDSTRY LD" ATIS=T
    RT  ID=0X18429A1A TY=CALONSC OWN=0X14429A0B EXLP=0X14429A08 CMDS=(R:0X8429A04,N:0X8429A05,C:0X14429A0B) SGCC=2 REL=2 DESC="AURANT YD TO TRK.1" ATIS=T
    RT  ID=0X18429A1B TY=CALONSC OWN=0X14429A0B EXLP=0X14429A0A CMDS=(R:0X8429A04,R:0X8429A05,C:0X14429A0B) SGCC=2 REL=3 DESC="AURANT YD TO TRK.2" ATIS=T
    SL  ID=0X14429A0C TY=STANDARD OWN=0X429A03 EXT=24 SLTK=0X1C429A0B MP=0X684003F1 SUB=0X24400007
    RT  ID=0X18429A1C TY=CALONSC OWN=0X14429A0C EXLP=0X14429A0B CMDS=(N:0X8429A04,C:0X14429A0C) SGCC=2 REL=1 DESC="AURANT YD-INDSTRY LD" ATIS=T
    TK  ID=0X1C429A08 TY=BLKTK OWN=0X429A03 EXT=12 LRMP=0X6C40BDAF LEN=5837 FSPD=60 PSPD=65 QUAL=TRK.1 MAXGE=0 MAXGW=0 JAL=4 ALT=12 SUB=0X24400007 RULES=(CTC:B:UP:0X24400007:485.7305:486.8359:T) LLON=-118.1766772 RLON=-118.1620059 LLAT=34.06838375 RLAT=34.07811764 LELE=416.6983 RELE=425.0596 ULAD=NO URAD=NO
    PT  ID=0X20429A0F TY=STANDARD OWN=0X1C429A08 LTK=0X1C40006C RTK=0X1C429A0C REL=1 LEN=1 LQUAL="TRK.1" RQUAL="TRK.1"
    PTK OWN=0X1C429A08 PID=0X1C429A13"""
    
    def unbalanced(line):
        close_symbols = {'"' : '"', '(': ")", '[': ']', "'" : "'"}
        syms = [x for x in line if x in '\'"[]()']
        stack = []
        for s in syms:
            try:
                if len(stack) > 0 and s == close_symbols[stack[-1]]:
                    stack.pop()
                else:
                    stack.append(s)
            except: # catches stack underflow or closing symbol lookup
                return True
        return len(stack) != 0
    
    
    print unbalanced("hello 'there' () []")
    print unbalanced("hello 'there\"' () []")
    print unbalanced("][")
    
    lines = d.splitlines()  # in your case you can do open("file.txt").readlines()
    
    print [line for line in lines if unbalanced(line)]
    

    对于大文件,您不希望将所有文件读入内存,因此请使用这样的片段:

    with open("file.txt") as infile:
        for line in infile:
            if unbalanced(line):
                print line
    

答案 3 :(得分:1)

正则表达式 - 如果你的行不包含嵌套括号,那么解决方案非常简单:

for line in myFile:
    if re.search(r"\([^\(\)]*($|\()", line):
        #this line contains unbalanced parentheses.

如果你正在处理嵌套语句的可能性,它会变得更复杂一些:

for line in myFile:
    paren_stack = []
    for char in line:
        if char == '(':
            paren_stack.append(char)
        elif char == ')':
            if paren_stack:
                paren_stack.pop()
            else:
                #this line contains unbalanced parentheses.

答案 4 :(得分:0)

我会做类似的事情:

for line in open(file, r):
    if line.count('"') % 2 != 0 or line.count('(') != line.count(')'):
        print(line)

但我无法确定这完全符合您的需求。

更强大:

for line in open(file, r):
    paren_count = 0
    paren_count_start_quote = 0
    quote_open = False
    for char in line:
        if char == ')':
            paren_count -= 1
        elif char == '(':
            paren_count += 1
        elif char == '"':
            quote_open = not quote_open
            if quote_open:
                paren_count_start_quote = paren_count
            elif paren_count != paren_count_start_quote:
                print(line)
                break
        if paren_count < 0:
            break
    if quote_open or paren_count != 0:
        print(line)

没有测试强大的,我认为应该工作。它现在可以确保以下内容:(&#34;)&#34 ;,在报价内部关闭一组parens打印该行。

答案 5 :(得分:0)

检查此代码

from tokenize import *
def syntaxCheck(line):
    def readline():
        yield line
        yield ''
    par,quo,dquo = 0,0,0
    count = { '(': (1,0,0),')': (-1,0,0),"'": (0,1,0),'"':(0,0,1) }
    for countPar, countQuo,countDQuo in (
      count.get(token,(0,0))+(token,) for _,token,_,_,_ in tokenize(readline().__next__)):
        par  += countPar
        quo  ^= countQuo
        dquo ^= countDQuo
    return par,quo,dquo

请注意,封闭引号内的括号不计数,因为它计为单字符串标记。

答案 6 :(得分:-1)

每条线路上的报价和报价是否应该关闭?如果是这种情况,您可以简单地计算括号和引号。如果它是偶数,它们是匹配的。如果它是奇怪的,那么就会丢失一个。将该逻辑放入函数中,将文本文件的行转储到数组中,并调用map为数组中的每个字符串执行函数。

我的python生锈了,但这就是我假设所有“应该”在同一条线上的方式。

答案 7 :(得分:-1)

我的解决方案可能不那么华丽,但我说你只计算括号和引号的数量。如果它没有出现偶数,你知道你错过了什么!