AWK脚本可处理分布在多个文件行中的信息

时间:2019-03-27 10:33:35

标签: awk

我正在尝试编写文本处理脚本,因为这似乎是一个相当简单的任务。 我有一个文件,其中包含以下重复模式

111 0 1000 other stuff        #<- here a new element begins
      some text &             #<- "&" or white spaces increment - 
      some more               #<- signal continue on next line
      last line 
221 1 1.22E22                 # new element $2!=0 must be followed by float
   text &
   contiuned text
c comment line in between 
   more text &
last line
2221 88 -12.123 &
line1 
   line2
c comment line 
last line
223 0 lll -111        $ element given by line 
22 22 -3.14           $ element given by new line

我想得到

111 0 1000 other stuff        #<- here a new element begins
      some text &             #<- "&" or white spaces increment - 
      some more               #<- signal continue on next line
      last line &
             xyz=1 
221 1 1.22E22                 # new element $2!=0 must be followed by float
   text &
   contiuned text
c comment line in between 
   more text &
last line &
      xyz=1
2221 88 -12.123 &
line1 
   line2
c comment line 
last line &
      xyz=1 
223 0 lll -111 &     $ element given by line
      xyz=1 
22 22 -3.14 &          $ element given by new line
      xyz=1

我想开发一个awk脚本,该脚本将一个字符串附加到每个元素的last line。为此,我的脚本将寻找新的元素模式,并继续读取直到找到下一个元素指示符之一。不幸的是,它无法正常运行,因为它会打印两次最后一行,并且无法追加到文件的最后一行。

function newelement(line) {
  split(line, s, " ")
  if (s[1] ~/^[0-9]+$/ && ((s[2] ~/^[0-9]+$/ && s[3] ~/\./) || (s[2] == 0 && s[3] !~/\./))) {
    return 1
  } else {
    return -1
  }
}

function contline(line) {
  if (line~/&/ || line~/^[cC]/ || line~/^\s{3,10}[^\s]./) {
    return 1
  } else {
    return -1
  }
}

BEGIN {
  subs = " xyz=1 "
} #increment to have the next line in store
FNR == 1 {
  getline nextline < FILENAME
} 
{ 
  # get the next line
  getline nextline < FILENAME
  if (newelement($0) == 1 && NR < 3673) {
    if (length(old) > 0 || $0~/^$/) {
      printf("%s &\n%20s\n", old, subs)
      print $0
    } 
    # to capture one line elements with no following continuation
    # i.e.
    # 221 91 0.5 33333
    # 22  0  11
    #look at the next line
    else if (($0!~/&/ && contline(nextline) == -1)) {
      printf("%s &\n%20s\n", $0, subs)
    }
  } 
  else {
  print "-" $0
  }
  # store last not - commented line
  if ($0!~/^\s{0,20}[cC]/) old = $0

}

注释行中有cc,后跟空白。注释行应保留,但不应在其后附加任何字符串。

2 个答案:

答案 0 :(得分:2)

请检查以下代码,让我知道它是否适合您:

$ cat 3.1.awk
BEGIN{
    subs      = " xyz=1 "
    threshold = 3673
}

# return boolean if the current line is a new element
function is_new_element(){
    return ($1~/^[0-9]+$/) && (($2 ~ /^[0-9]+$/ && $3~/\./) || ($2 == 0 && $3 !~/\./))
}

# return boolean if the current line is a comment or empty line
function is_comment() {
    return /^\s*[cC] / || /^\s*$/
}

# function to append extra text to line
# and followed by comments if applicable
function get_extra_text(     extra_text) {
    extra_text = sprintf("%s &\n%20s", prev, subs)
    text = (text ? text ORS : "") extra_text
    if (prev_is_comment) {
        text = text ORS comment
        prev_is_comment = 0
        comment = ""
    }
    return text
}

NR < threshold {
# replace the above line with the following one if 
# you want to process up to the first EMPTY line
#NR==1,/^\s*$/ {
    # if the current line is a new element
    if (is_new_element()) {
        # save the last line and preceeding comments 
        # into the variable 'text', skip the first new element
        if (has_hit_first_new_element) text = get_extra_text()
        has_hit_first_new_element = 1
        prev_is_new = 1
    # before hitting the first new_element line, all lines 
    # should be printed as-is
    } else if (!has_hit_first_new_element) {
        print
        next
    # if current line is a comment
    } else if (is_comment()) {
        comment = (comment ? comment ORS : "") $0
        prev_is_comment = 1
        next
    # if the current line is neither new nor comment
    } else {
        # if previous line a new element
        if (prev_is_new) {
            print (text ? text ORS : "") prev
            text = ""
        # if previous line is comment
        } else if (prev_is_comment) {
            print prev ORS comment
            prev_is_comment = 0
            comment = ""
        } else {
            print prev
        }
        prev_is_new = 0
    }
    # prev saves the last non-comment line
    prev = $0
    next
}
# print the last block if NR >= threshold 
!is_last_block_printed {
    print get_extra_text()
    is_last_block_printed = 1;
}

# print lines when NR > threshold or after the first EMPTY line
{   print "-" $0 }

位置

这些行分为3类,并以不同的方式处理:

  1. is_new_element()为true,当当前行是新元素时,标志prev_is_new标识先前的新元素
  2. is_comment()功能为true,则当前行为注释,prev_is_comment标识先前的注释行
  3. 其他行:除上述两个以外的所有其他行

其他说明:

  • 您可以选择NR < threshold(代码中的3673)或范围模式NR==1,/^\s*$/来仅处理一定范围的行。
  • is_last_block_printed标志和相关代码将确保在上述范围的末尾或END{}块中打印最后一个处理块
  • 我没有检查连续行的末尾&,如果它们后面是注释或新元素,则必须定义逻辑,即哪个应该优先
  • 如果第一行is_new_element()之前还有其他行,则该代码将无法正常工作。可以通过添加另一个标志而不使用if (NR > 1)来更新text来解决此问题。

测试样本:

$ cat 3.1.txt
111 0 1000 other stuff        #<- here a new element begins
      some text &             #<- "&" or white spaces increment -
      some more               #<- signal continue on next line
      last line
221 1 1.22E22                 # new element $2!=0 must be followed by float
    text &
   contiuned text
c comment line in between
   more text &
last line
2221 88 -12.123 &
line1
   line2
c comment line 1
last line
c comment line 2
c comment line 3
c comment line 4
c comment line 5
223 0 lll -111        
223 0 22 -111        
223 0 22 -111        
c comment line in between 1
c comment line in between 2
22 22 -3.14         
c comment line at the end

输出:

$ awk -f 3.1.awk 3.1.txt
111 0 1000 other stuff        #<- here a new element begins
      some text &             #<- "&" or white spaces increment - 
      some more               #<- signal continue on next line
      last line  &
              xyz=1 
221 1 1.22E22                 # new element $2!=0 must be followed by float
   text &
   contiuned text
c comment line in between 
   more text &
last line &
              xyz=1 
2221 88 -12.123 &
line1 
   line2
c comment line 1
last line &
              xyz=1 
c comment line 2
c comment line 3
c comment line 4
c comment line 5
223 0 lll -111  &
              xyz=1 
223 0 22 -111  &
              xyz=1 
223 0 22 -111  &
              xyz=1 
c comment line in between 1
c comment line in between 2
22 22 -3.14    &
              xyz=1 
c comment line at the end

一些额外的说明:

  • 在将subs附加到prev行时,尾随换行符“ \ n”是处理文本的一个问题。当连续出现new_element行时,这一点尤其重要。

  • 重要的是,代码中的变量prev被定义为前一个非注释行(上面定义的类别1、3)。 prev行和当前行之间可能有零个或多个注释(类别2)行。这就是为什么我们在打印常规注释(而不是在new_element行之前的注释)时使用print prev ORS comment而不是print comment ORS prev

  • 一组comment行(1条或更多连续的注释行)被保存到变量comment中。如果它恰好在new_element行之前,则将该块附加到变量text上。其他所有注释块将打印在上述print prev ORS comment

  • 函数get_extra_text()用于处理extra_text,其顺序为:prev subs ORS comments,其中comments仅在prev_is_comment标志为{{ 1}}。请注意,如果连续有new_element行,则同一变量1可能已经保存了多个text块。

  • 我们仅在上述类别3行上prev subs ORS comments(既没有new_element也没有评论)。当我们不用担心尾随换行符或extra_text时,这是一个安全的地方:

    • 如果是prev_is_new,我们将打印缓存的print,然后打印变量prev(这是一个new_element)
    • 如果是prev_is_comment,我们只打印text。再次注意,变量prev ORS comment保存当前行中的最后一个非注释行,它不必是当前行上方的行。
    • 在所有其他情况下,只需按原样打印prev
  • 由于我们将行连接到prevtext变量中,因此我们使用以下语法来避免前导ORS(默认情况下为“ \ n”)

    comment

    如果不需要担心领先的ORS,则可以使用以下命令:

    text = (text ? text ORS : "") prev

    ,由于这些行已附加到这些变量之后,因此我们需要重置 每次使用它们后(即text = text ORS prev),否则, 串联变量将包含所有先前处理过的行。

最后的记录

  1. 添加了一个标志text = "",以防第一行new_element行之前有行,它们将按原样打印。在此代码中,应该以不同的方式对待第一行new_element行,使用NR == 1并不是安全带。
  2. 删除了has_hit_first_new_element块中多余的代码

答案 1 :(得分:1)

尝试一下:

function newelement(line){
    split(line,s," ")
    if(s[1]~/^[0-9]+$/ && ((s[2]~/^[0-9]+$/ && s[3]~/\./)|| (s[2]==0 && s[3]!~/\./))){return 1}
    else{return -1}
}

BEGIN{
    subs=" xyz=1 "
} 
{
    if (length($0)==0) next   # Skip empty lines, remove or change it according to your needs.
    if (newelement($0)==1){
        if (length(last_data)>0) {
            printf("%s &\n%20s\n",last_data,subs)
            if (last_type=="c") {
                print comments
            }
        }
        last_data=$0
        last_type="i"
    } else if($0 ~/^\s*[cC] /) {
        if (last_type=="c") comments = comments ORS $0
        else comments = $0
        last_type="c"
    } else {
        if (last_type=="c") print comments
        else if(length(last_data)>0) print last_data
        last_data=$0
        last_type="d"
    }
}
END{
    printf("%s &\n%20s\n",last_data,subs)
    if (last_type=="c") print comments
}

三个变量:

  • last_data保留最后一条数据行。
  • last_type用于保留最后一行的类型,i用于指示符,c用于注释。
  • comments保留评论行。