awk比较两个文件并打印格式化输出

时间:2017-09-16 13:32:19

标签: unix awk

我想根据每个文件的第一个字段$1比较两个文件。

然后从两个文件中填充匹配行 - (在Aug.csv和Sep.csv中可用)并打印最后一个字段备注为"匹配"

来自Aug.csv的非匹配行 - (可在Aug.csv中在Sep.csv中不可用)并且找不到打印(即" NOT")类似于No of fields的5倍( $ NF) 在Sep.csv文件中" NOT,NOT,NOT,NOT,NOT"并打印最后一个字段备注为"不在Sep.csv"或FILENAME

来自Sep.csv的非匹配行 - (在Sep.csv中可用,在Aug.csv中不可用)和未找到的打印(即" NOT")4倍相当于字段数( $ NF) 在Aug.csv文件中" NOT,NOT,NOT,NOT"并打印最后提交的备注为"不在Aug.csv"或FILENAME

Aug.csv

Name,Age,Place,Des
aaa,40,xxx,Aug
aaa,20,yyy,Aug
ccc,35,xxx,Aug

Sep.csv

Name,Age,Place,Edu,Des
aaa,50,zzz,eee,Sep
bbb,30,xxx,yyy,Sep
aaa,60,yyy,fff,Sep
bbb,50,yyy,fff,Sep

预期的Output.csv

Name,Age,Place,Des,Name,Age,Place,Edu,Des,Remarks
aaa,40,xxx,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,40,xxx,Aug,aaa,60,yyy,fff,Sep,Matched
aaa,20,yyy,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,20,yyy,Aug,aaa,60,yyy,fff,Sep,Matched
NOT,NOT,NOT,NOT,bbb,30,xxx,yyy,Sep,Not in Aug.csv
NOT,NOT,NOT,NOT,bbb,50,yyy,fff,Sep,Not in Aug.csv
ccc,35,xxx,Aug,NOT,NOT,NOT,NOT,NOT,Not in Sep.csv

我在下面尝试了两个命令来获得所需的输出但是没有成功

第一个命令:

 awk -v first="NOT,NOT,NOT,NOT"  -v second="NOT,NOT,NOT,NOT,NOT" -F"," 'NR==FNR{a[$1]=$0;next}{if (a[$1])print a[$1],$0,"Matched";else print first, $0,"Not in Aug.csv";}' OFS="," Aug.csv Sep.csv >Output.csv

第二个命令:

awk -v first="NOT,NOT,NOT,NOT"  -v second="NOT,NOT,NOT,NOT,NOT" -F"," 'NR==FNR{a[$1]=$0;next} !($1 in a) {print $0,second,"Not in Sep.csv";}' OFS="," Sep.csv Aug.csv  >>Output.csv  

从上面的命令

获得了以下的Output.csv
Name,Age,Place,Des,Name,Age,Place,Edu,Des,Matched
aaa,20,yyy,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,20,yyy,Aug,aaa,60,yyy,fff,Sep,Matched
NOT,NOT,NOT,NOT,bbb,30,xxx,yyy,Sep,Not in Aug.csv
NOT,NOT,NOT,NOT,bbb,50,yyy,fff,Sep,Not in Aug.csv
ccc,35,xxx,Aug,NOT,NOT,NOT,NOT,NOT,Not in Sep.csv

在这里,我错过了预期输出中的以下两个匹配行(Aug.csv)。请告知如何处理这个...似乎它忽略了重复的条目

aaa,40,xxx,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,40,xxx,Aug,aaa,60,yyy,fff,Sep,Matched

想知道这是一个动态变量"$first"和" $second"(即awk -v first="NOT,NOT,NOT,NOT" -v second="NOT,NOT,NOT,NOT,NOT")基于Aug.csv&中可用的字段/标题的数量。 Sep.csv 因为在原始文件中包含更多字段,并且每次都有10个字段,15个字段等变化...不想输入10次" NOT"手动 或者根据原始文件中的“字段数”,是否有任何方法REPEAT在打印"FS"时起作用。 这样我的输出格式将低于

预期的Output.csv

Name,Age,Place,Des,Name,Age,Place,Edu,Des,Remarks
aaa,40,xxx,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,40,xxx,Aug,aaa,60,yyy,fff,Sep,Matched
aaa,20,yyy,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,20,yyy,Aug,aaa,60,yyy,fff,Sep,Matched
,,,,bbb,30,xxx,yyy,Sep,Not in Aug.csv
,,,,bbb,50,yyy,fff,Sep,Not in Aug.csv
ccc,35,xxx,Aug,,,,,,Not in Sep.csv

请告知,寻找你的建议......

2 个答案:

答案 0 :(得分:2)

复杂的GNU awk 解决方案:

compare.awk 脚本:

@Component({
  selector: 'app-course',
  templateUrl: './course.component.html',
  styleUrls: ['./course.component.css'],
  styles:[
  `


  `],
})

用法:

function prNot(n) { 
    r=s="NOT"; while(--n) r=r FS s; 
    return r 
}
BEGIN{ FS=OFS="," }
NR==FNR{ 
    if (NR==1) { 
        sep_nf=NF; sep_fn=FILENAME; h=$0 
    } else { 
        sep[$1][++c]=$2; 
        for(i=3;i<=NF;i++){ sep[$1][c]=sep[$1][c] FS $i } 
    }
    next 
}
FNR==1{ 
    aug_nf=NF; aug_fn=FILENAME; print $0,h,"Remarks"; next 
}
$1 in sep{ matched[$1]; for(i in sep[$1]) print $0,$1,sep[$1][i],"Matched" }
!($1 in sep){ print $0,prNot(sep_nf),"Not in "sep_fn }
END{ 
    for(i in sep) 
        if (!(i in matched)) { 
            for(j in sep[i]) print prNot(aug_nf),i,sep[i][j],"Not in "aug_fn 
        }  
}

输出:

awk -f compare.awk Sep.csv Aug.csv

答案 1 :(得分:2)

使用GNU awk实现真正的多维数组:

$ cat tst.awk
BEGIN { FS=OFS="," }
FNR==1 {
    for (i=1; i<=NF; i++) {
        nots[ARGIND] = (i>1 ? nots[ARGIND] OFS : "") "NOT"
    }
}
NR==FNR {
    file1[$1][++cnt[$1]] = $0
    next
}
{
    file2[$1]
    if ($1 in file1) {
        for (num in file1[$1]) {
            print file1[$1][num], $0, (FNR>1 ? "Matched" : "Remarks")
        }
    }
    else {
        print nots[1], $0, "Not in " ARGV[1]
    }
}
END {
    for (name in file1) {
        if ( !(name in file2) ) {
            for (num in file1[name]) {
                print file1[name][num], nots[2], "Not in " ARGV[2]
            }
        }
    }
}

$ awk -f tst.awk Aug.csv Sep.csv
Name,Age,Place,Des,Name,Age,Place,Edu,Des,Remarks
aaa,40,xxx,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,20,yyy,Aug,aaa,50,zzz,eee,Sep,Matched
NOT,NOT,NOT,NOT,bbb,30,xxx,yyy,Sep,Not in Aug.csv
aaa,40,xxx,Aug,aaa,60,yyy,fff,Sep,Matched
aaa,20,yyy,Aug,aaa,60,yyy,fff,Sep,Matched
NOT,NOT,NOT,NOT,bbb,50,yyy,fff,Sep,Not in Aug.csv
ccc,35,xxx,Aug,NOT,NOT,NOT,NOT,NOT,Not in Sep.csv

如果输出顺序很重要,那么有多种方法可以处理它......