使用awk

时间:2018-12-04 00:06:05

标签: awk

我有一个数据文件,其中每个字段都位于单独的行上,如下所示。记录中显示的特定字段会有所不同,因此我无法使用任何将字段串联起来的解决方案,而不必知道它们是什么

输入样本

Creator=Burroughs Wellcome and Company
Date=ca. 1906
Description=Blue cardboard box, measuring 5.5 cm x 4.3 cm x 2.2 cm. Box in fair condition.
Identifier=77-97.1.3a
DOI=doi:10.6083/M4H41PRC
Medium=Cardboard
Relation=References 77-97.1.3b.jpg
Rights=COPYRIGHT NOT EVALUATED 
Source=Medical Museum Collection, Box 1
Subject=Vaporole;;;Epinine;;;Deoxyepinephrine;;;Pharmaceutical Preparations
Title=Box containing medicine vials
Type=Still Image
collection=2
filename=df0968b22c1072c8909538c516dc81b6.jpg
id=10959

Date=ca. 1906
Description=Two stemmed amber glass vials in a blue cardboard box. 
Identifier=77-97.1.3b
DOI=doi:10.6083/M4CC0Z0M
Medium=Glass;;;Cardboard
Relation=IsPartOf 77-97.1.3a.jpg
Rights=COPYRIGHT NOT EVALUATED
Source=Medical Museum Collection, Box 1
Subject=Vials;;;Vaporole;;;Epinine;;;Deoxyepinephrine;;;Pharmaceutical Preparations
Title=Medicine vials in a box
Type=Still Image
collection=2
filename=9e846a60d8a79de37e91279696e520e6.jpg
id=10960

我需要将其转换为定界文件。由于字段可能存在或可能不存在,因此我需要枚举列以进行记录,例如标题,创建者,日期,标识符等。

在awk中是否有一种巧妙的方法来执行此操作,还是我需要硬着头皮编写程序?

1 个答案:

答案 0 :(得分:0)

您没有提供示例输出,因此很可能是这样,但这可能是您想要的:

Func01

$ cat tst.awk
BEGIN {
    RS   = ""
    FS   = "\n"
    OFS  = ","
    ofmt = "\"%s\"%s"
}
NR == FNR {
    for (i=1; i<=NF; i++) {
        name = $i
        sub(/=.*/,"",name)
        if ( !seen[name]++ ) {
            nr2name[++numNames] = name
        }
    }
    next
}
FNR == 1 {
    for (nameNr=1; nameNr<=numNames; nameNr++) {
        name = nr2name[nameNr]
        printf ofmt, name, (nameNr<numNames ? OFS : ORS)
    }
}
{
    delete name2val
    for (fldNr=1; fldNr<=NF; fldNr++) {
        name = val = $fldNr
        sub(/=.*/,"",name)
        sub(/[^=]+=/,"",val)
        name2val[name] = val
    }

    for (nameNr=1; nameNr<=numNames; nameNr++) {
        name = nr2name[nameNr]
        val  = name2val[name]
        printf ofmt, val, (nameNr<numNames ? OFS : ORS)
    }
}