使用python剥离文本块

时间:2016-12-04 23:35:29

标签: python

根据另一个帖子的要求,我在下面给出了用于从具有以下格式的文件中删除文本块的代码。如前所列,我试图通过此代码解决的问题如下,

  1. 使用从另一个文件(file2)创建的模式解析文件(file1)中的文本块。 file1和file2都作为命令行参数提供。
  2. 确定文本块的逻辑是计算'{'& '}'在该部分中的大括号(因为文本块在该部分中包含许多大括号)。有一点需要注意的是,有一个'开始'和包含块文本的“结束”行(即括号)。在我的代码中,我试图跟踪这两个,因为file1的文件格式有时可能没有'Begin'/'End'行,但是大括号将始终存在。

    我需要有关如何为运行时改进此代码以及简洁(代码优化)的建议。请注意,file1是一个包含数十万行的巨大文件,但file2很小,大约是100行。我试图在代码中尽可能地添加注释,以使其更容易阅读。

    file1格式如下所示

    /* Begin : abcxyz*/
    cell ("pattern1") {
    /* ---------------------------------------------------------------------- */
    /* Comment lines */
    /* ---------------------------------------------------------------------- */
    line 1
    line 2 {
    }
    line 3
    }
    /* End : abcxyz*/
    

    下面列出了实际代码,

    import sys # Module to work with argv parameters
    import re # Module to work with regular expressions
    
    in_file = sys.argv[1] # Setting the first argument as file1.
    pattern_file = sys.argv[2] # Setting the second argument as file2 (containing the patterns to be parsed in file1).
    
    patterns = [] # Creating a empty list for populating the pattern details.
    with open (pattern_file, 'r') as file2: # Opening the file2 in read mode.
        for pattern in sorted(set(file2.readlines())): # Sorting and making the pattern list unique while reading each pattern.
            patterns.append(pattern.rstrip('\n')) # Stripping the newline character and building the pattern array set.
    
    out_flag = False
    forward_brace = 0
    backward_brace = 0
    scope_count = 0
    out_file = open ("file3", 'w')
    with open (in_file, 'r') as in_lib: # Opens the input file(file1) for reading.
        for line in in_lib.readlines(): # Reads the entire content of the file1 in the form of list
            # Creates a generator expression; tries to get the first match from the file1 based on the pattern list
            # Once the begin block is found, the flag to write the output file is set to True
            if any(p in line for p in patterns) and 'Begin ' in line:
                forward_brace = backward_brace = 0
                out_flag = True
            # For all the lines other than 'Begin' statement, the brace count is calculated.
            # The brace count is kept track for determining the scope of the cell block.
            else:
                # Matches any line starting with '/*' or '*' to avoid counting the brace for scope determination.
                if any(re.match(r, line) for r in ['^\s*/','^\s*[*]']):
                    out_file.write(line)
                    continue
                else:
                    forward_brace += line.count('{')
                    backward_brace += line.count('}')
                    scope_count = forward_brace - backward_brace
            # Boolean check on flag performed for writing to the output file.
            # If the 'End ' block is arrived at then the out_flag is set to False.
            if out_flag:
                out_file.write(line)
                if 'End ' in line:
                    out_flag = False
            # Once the end of the scope block is arrived at ie., brace count is 0 and
            # also flag is set to False, the tracking variables are reset for next cell.
            if scope_count == 0 and not (out_flag):
                forward_brace = backward_brace = 0
                out_flag = False
    
    out_file.close()
    

0 个答案:

没有答案