awk with multiline regex;输出文件名基于awk匹配

时间:2015-01-01 04:43:17

标签: regex awk

我目前正试图从22kLoC文件中提取300多个函数和子程序,并决定尝试以编程方式进行(我手工完成了“最大”的块)。

考虑表格

的文件
declare sub DoStatsTab12( byval shortlga as string)
declare sub DoStatsTab13( byval shortlga as string)
declare sub ZOMFGAnotherSub

Other lines that start with something other than "/^sub \w+/" or "/^end sub/"

sub main

    This is the first sub: it should be in the output file mainFunc.txt

end sub

sub test

    This is a second sub

    it has more lines than the first.

    It is supposed to go to testFunc.txt

end sub

Function ConvertFileName(ByVal sTheName As String) As String

    This is a function so I should not see it if I am awking subs

    But when I alter the awk to chunk out functions, it will go to ConvertFileNameFunc.txt    

End Function

sub InitialiseVars(a, b, c)

    This sub has some arguments - next step is to parse out its arguments
    Code code code;
    more code;
    ' maybe a comment, even? 


  and some code which is badly indented (original code was written by a guy who didn't believe in structure or documentation)

    and


  with an arbitrary number of newlines between bits of code because why not? 


    So anyhow - the output of awk should be everything from sub InitialiseVars to end sub, and should go into InitialiseVarsFunc.txt

end sub

要点:找到以...开头的行集 ^sub [subName](subArgs)  结束 ^end sub

然后(以下是我的意思):保存将提取的子程序保存到名为[subName]Func.txt

的文件中

awk建议自己作为候选人(我过去使用preg_match()在PHP中编写了文本提取正则表达式查询,但我不想指望有WAMP / LAMP可用性)。

我的出发点是令人愉快的简约(双引号,因为Windows)

awk "/^sub/,/^end sub/" fName

这会找到相关的块(并将它们打印到stdout)。

将输出放到文件中,并在$2捕获awk之后命名文件的步骤超出了我的范围。

此过程的早期阶段涉及awk子程序名称并存储它们:这很容易,因为每个子程序都由表单的单行程声明

declare sub [subName](subArgs)

所以这就是这样,而且完美 -

awk "match($0, /declare sub (\w+)/)
{print substr($3, RSTART, index($3, \"(\")>0 ? index($3, \"(\")-1: RLENGTH)
     > substr($3, RSTART, index($3, \"(\")>0 ? index($3, \"(\")-1: RLENGTH)\".txt\"}"
fName

(我试图提出它,以便很容易看出$3的输出文件名和awk - 解析到第一个')'如果有的话 - 是同一个东西)。

在我看来,如果输出

awk '/^sub/,/^end sub/' fName

连接成一个数组,然后 $ 2 (在'(')处适当截断将起作用。但它没有。

我查看了处理多行awk的各种SO(和其他SE系列)线程 - 例如this onethis one,但没有一个给我足够的头关于我的问题(他们帮助获得匹配本身,但没有将它管道到以自己命名的文件)。

我有awk(和grep)的RTFD,也无济于事。

2 个答案:

答案 0 :(得分:4)

我建议

awk -F '[ (]*' '            # Field separator is space or open paren (for
                            # parameter lists). * because there may be multiple
                            # spaces, and parens only appear after the stuff we
                            # want to extract.
  BEGIN { IGNORECASE = 1 }  # case-insensitive pattern matching is probably
                            # a good idea because Basic is case-insensitive.
  /^sub/ {                  # if the current line begins with "sub"
    outfile = $2 "Func.bas" # set the output file name
    flag = 1                # and the flag to know that output should happen
  }
  flag == 1 {               # if the flag is set
    print > outfile         # print the line to the outfile
  }
  /^end sub/ {              # when the sub ends, 
    flag = 0                # unset the flag
  }
' foo.bas

请注意,使用简单的模式匹配工具解析源代码容易出错,因为编程语言通常不是常规语言(除了Brainfuck之外的一些例外)。这种事情总是取决于代码的格式。

例如,如果在代码中的某个地方,子声明被分成两行(这可以用_,我相信,虽然Basic不是我每天都做的事情),试图提取从其定义的第一行开始的子名称是徒劳的。格式化也可以对必要的模式进行微调;在一行开头的多余空间之类的东西需要处理。严格使用此内容进行一次性代码转换并验证它是否产生了所需的结果,不要试图将其作为常规工作流的一部分。

答案 1 :(得分:1)

另一种方式

awk -F'[ (]' 'x+=(/^sub/&&file=$2"Func.txt"){print > file}/^end sub/{x=file=""}' file

解释

awk -F'[ (]'                   - Set field separator to space or brackets

x+=(/^sub/&&file=$2"Func.txt") - Sets x to 1 if line begins with sub and sets file 
                                 to the second field + func.txt. As this is a 
                                 condition that is checking if x is true then the 
                                 next block will repeatedly be executed until x 
                                 is unset.

{print > file}                 - Whilst x is true print the line into the set filename


/^end sub/{x=file=""}          - If line begins with end sub then set both x and file 
                                 to nothing.