使用awk如何将固定宽度的多行记录转换为单行记录

时间:2015-03-27 08:09:33

标签: awk

我想将固定宽度的文件多行记录转换为单行记录。该文件包含4 fields date stampseverityerror code& message type记录数据可以根据字段中的数据跨越多行。例如 date stamp字段宽度为10个字符 - 但数据值为19个字符,因此它分布在两行中。前10个字符位于第一行,后9个字符位于第二行

字段位置

日期戳= 1 - 10 严重性= 12-17 [值可能是错误,信息,警告,所以如果值是警告,则剩余数据放在12-17的第二行] error_code = 18 -25 消息= 26-70

记录之间没有空行。

2014-02-21 INFO UTF8_INT  Starting execution of workflow
07:01:59                  [wf_router] in domain.

2014-02-21 error UTF8_INT  SQ_ff:Exchange: Rowdata: ( RowType=0
07:01:59                 (insert) Src Rowid=1 TargIELD:Char.500:):
                          ".Improved By Resting
                         [[<~a~>Resting<~a0~>]]|Lying Down
                         [[<FNT><!>no Lying Down]]).

2014-02-21 warni UTF8_INT  SQ_ff:Exchange: Rowdata: ( RowType=0
           ng              (insert) Src Rowid=1 TargIELD:Char.500:):
                          ".Improved By Resting
                         [[<~a~>Resting<~a0~>]]|Lying Down
                         [[<FNT><!>no Lying Down]]).

http://i.stack.imgur.com/EAHSR.png

2 个答案:

答案 0 :(得分:0)

虽然awk设计用于字段分隔符(默认情况下为空格),但awk也可以读取固定宽度的文件。要检索宽度为 w 的字段,请从列 p (其中1是一行中最左侧的位置)开始,使用substr($0, p, w)。要跨行累积列数据,您只需在每列中使用一个变量。

{
    if (/[^ \t]/) {
        datetime = datetime " " trim(substr($0, 1, 10));
        severity = severity substr($0, 12, 5);
        errorcode = errorcode substr($0, 18, 8);
        message = message " " trim(substr($0, 26));
    }
    else {
        output();
        datetime = severity = errorcode = message = "";
    }
}

END {
    output();
}

function output() {
    if (datetime || severity || errorcode || message) {
        print trim(datetime) " ; " trim(severity) " ; " trim(errorcode) " ; " trim(message);
    }
}

function trim(s) {
    gsub(/^[ \t]+|[ \t]+$/, "", s);
    return s;
}

输入(通知我清理了第1行UTF8_INT的对齐方式):

2014-02-21 INFO  UTF8_INT Starting execution of workflow
07:01:59                  [wf_router] in domain.

2014-02-21 error UTF8_INT  SQ_ff:Exchange: Rowdata: ( RowType=0
07:01:59                 (insert) Src Rowid=1 TargIELD:Char.500:):
                          ".Improved By Resting
                         [[<~a~>Resting<~a0~>]]|Lying Down
                         [[<FNT><!>no Lying Down]]).

2014-02-21 warni UTF8_INT  SQ_ff:Exchange: Rowdata: ( RowType=0
           ng              (insert) Src Rowid=1 TargIELD:Char.500:):
                          ".Improved By Resting
                         [[<~a~>Resting<~a0~>]]|Lying Down
                         [[<FNT><!>no Lying Down]]).

输出:

2014-02-21 07:01:59 ; INFO ; UTF8_INT ; Starting execution of workflow [wf_router] in domain.
2014-02-21 07:01:59 ; error ; UTF8_INT ; SQ_ff:Exchange: Rowdata: ( RowType=0 (insert) Src Rowid=1 TargIELD:Char.500:): ".Improved By Resting [[<~a~>Resting<~a0~>]]|Lying Down [[<FNT><!>no Lying Down]]).
2014-02-21 ; warning ; UTF8_INT ; SQ_ff:Exchange: Rowdata: ( RowType=0 (insert) Src Rowid=1 TargIELD:Char.500:): ".Improved By Resting [[<~a~>Resting<~a0~>]]|Lying Down [[<FNT><!>no Lying Down]]).

注意:

  • 由于您没有回答我的所有问题,我无法知道该脚本是否符合您的所有要求。
  • 假设您在输出数据中需要使用分号作为字段分隔符,我想知道您将如何处理输入数据中已存在的分号。我应该申请某种逃避吗?

答案 1 :(得分:0)

这样的东西就是你所需要的(使用GNU awk进行各种扩展):

$ cat tst.awk
BEGIN { FIELDWIDTHS="10 1 5 1 8 45" }

/^[0-9]{4}(-[0-9]{2}){2}/ && (NR>1) { prtrec() }

{
    for (i=1;i<=NF;i++) {
        rec[i] = rec[i] $i
    }
}

END { prtrec() }

function prtrec() {
    n=split("1 3 5 6",f)
    for (i=1;i<=n;i++) {
        gsub(/^\s+|\s+$/,"",rec[f[i]])
        printf "%s%s", rec[f[i]], (i<n?OFS:ORS)
    }
    delete rec
}

$ gawk -f tst.awk file
2014-02-2107:01:59 INFO UTF8_INT Starting execution of workflow [wf_router] in domain.
2014-02-2107:01:59 error UTF8_INT SQ_ff:Exchange: Rowdata: ( RowType=0(insert) Src Rowid=1 TargIELD:Char.500:): ".Improved By Resting[[&lt;~a~&gt;Resting&lt;~a0~&gt;]]|Lying Down[[&lt;FNT&gt;&lt;!&gt;no Lying Down]]).
2014-02-21 warning UTF8_INT SQ_ff:Exchange: Rowdata: ( RowType=0  (insert) Src Rowid=1 TargIELD:Char.500:): ".Improved By Resting[[&lt;~a~&gt;Resting&lt;~a0~&gt;]]|Lying Down[[&lt;FNT&gt;&lt;!&gt;no Lying Down]]).

只是猜测你想要的输出,因为你没有在你的问题中发帖。