根据第三列中的重复字符串,在多列中打印具有最小值的行

时间:2017-09-13 17:02:18

标签: bash awk

我正在尝试过滤掉特定列中具有非唯一字符串的行,同时仅保留其他2列中具有最小值的行,当然还有那些没有重复行的行。请参阅我的示例表:

Col1  Col2  Col3  Col4  Col5  Col6  Col7
blah  blah  1     blah  blah  1     BBBB
blah  blah  0     blah  blah  3     AAAA
blah  blah  1     blah  blah  3     BBBB
blah  blah  2     blah  blah  0     AAAA
blah  blah  0     blah  blah  0     AAAA
blah  blah  8     blah  blah  3     CCCC

Col1,Col2,Col4和Col5并不重要,只能复制。如果Col7中的字符串多次出现,在所有出现的情况下,我只想打印在Col3中最低值的行,然后是如果存在平局,Col6中最低值。最后,我想添加一个新列,表示“唯一”或“多个”指定是否存在重复。

我想要的输出是这样的:

Col1  Col2  Col3  Col4  Col5  Col6  Col7  Col8
blah  blah  0     blah  blah  0     AAAA  multi
blah  blah  1     blah  blah  1     BBBB  multi
blah  blah  8     blah  blah  3     CCCC  unique

到目前为止,我用awk努力了。我可以找到所有带有重复字符串的行,并将它们打印成一行,但我不知道在打印之前如何过滤它。

 awk '{dup[$7]=dup[$7] ? dup[$7] " duplicate of " $1 : $1} END {for (x in dup) print dup[x], x}'

任何帮助都会非常感激,awk中的解决方案(请说明请)会更受欢迎,因为我试图更好地理解它。

编辑以便更好地理解。

4 个答案:

答案 0 :(得分:0)

我们假设我们有以下Input_file,其中我举例说明第3列有一些联系,第6列的价值低于第3列。

<强> INPUT_FILE:

cat Input_file
Col1  Col2  Col3  Col4  Col5  Col6  Col7
blah  blah  0     blah  blah  -3     AAAA
blah  blah  1     blah  blah  3     BBBB
blah  blah  2     blah  blah  -9     AAAA
blah  blah  0     blah  blah  0     AAAA
blah  blah  1     blah  blah  1     BBBB
blah  blah  8     blah  blah  3     CCCC

现在,以下是相同的代码。

<强>代码:

awk '
FNR>1 && FNR==NR{
  a[$NF]=a[$NF]>$3?$3:a[$NF]?a[$NF]:$3;
  c[$NF]++;
  b[$NF,$3]++;
  e[$3]=$3
  d[$NF]=d[$NF]>$6?$6:d[$NF]?d[$NF]:$6;
  next
}
FNR==1 && FNR!=NR{
  print $0,"Col8";
  for(i in b){
      val=i
      gsub(/[0-9]+/,"",val)
    if(b[i]>1 && a[val]==e[val]){
      a[val]=a[val]<d[val]?a[val]:d[val]
    }}
  next
}
($NF in a) && ($6==a[$NF]||$3==a[$NF]){
  printf("%s %s\n",$0,c[$NF]>1?"Multi":"Unique");
  delete a[$NF]
}
' SUBSEP=""    Input_file  Input_file

执行代码:

./script.ksh
Col1  Col2  Col3  Col4  Col5  Col6  Col7 Col8
blah  blah  1     blah  blah  3     BBBB Multi
blah  blah  2     blah  blah  -9     AAAA Multi
blah  blah  8     blah  blah  3     CCCC Unique

答案 1 :(得分:0)

排序 + uniq + sed 技巧:

echo "$(head -1 file)  Col8" && \
      sort -k7 -k3,3 -k6,6 <(tail -n +2 file) | uniq -cf6 \
      | sed -E 's/^ *1 (.*)/\1  unique/; s/^ *([2-9]|[0-9]{2,}) (.*)/\2  multi/'

输出:

Col1  Col2  Col3  Col4  Col5  Col6  Col7  Col8
blah  blah  0     blah  blah  0     AAAA  multi
blah  blah  1     blah  blah  1     BBBB  multi
blah  blah  8     blah  blah  3     CCCC  unique

<强> ----------

GNU datamash + awk

Bonus 解决方案:

datamash -WHfs -g7 count 7 min 6 min 3 <file \
    | awk 'NR==1{ $8="Col8" }NR>1{ $3=$10; $6=$9; $8=($8>1)?"multi":"unique" }{$9=$10=""}1' \
    | column -t

答案 2 :(得分:0)

awk救援!在其他工具的帮助下

$ head -1 file && sed 1d file | 
       sort -k7 -k3,3n -k6,6n | 
       uniq -c -f6 | 
       awk '!a[$NF]++{c=$1; gsub(" +"$1" +",""); print $0,c==1?"uniq":"multi"}'

Col1  Col2  Col3  Col4  Col5  Col6  Col7
blah  blah  0     blah  blah  0     AAAA multi
blah  blah  1     blah  blah  1     BBBB multi
blah  blah  8     blah  blah  3     CCCC uniq

当然如果你的标题不存在,你可以摆脱第一部分

答案 3 :(得分:0)

使用awk

<强>一衬垫

awk 'FNR==1{print $0,"Col8";next}function cp(){a[$7]=$0;b[$7]=$3;c[$7]=$6}{f=$7 in a;  d[$7]++}!f{o[++i]=$7; cp(); next}f && b[$7]>$3{cp();next}f && b[$7]==$3 && $6<c[$7]{cp();next}END{for(i=1; i in o; i++)print a[o[i]],(d[o[i]]>1?"multi":"unique")}' file

<强>解释

awk '
      # if first line was read then print current record and extra field as header
      # go to next line
      FNR==1{
             print $0,"Col8";
             next
      }
      # function which will be used frequently to store data
      function cp(){
             a[$7]=$0;
             b[$7]=$3;
             c[$7]=$6;
       }
       # variable f holds boolean status whether array a has key of field7 value
       # variable f will be used frequently  
       {
             f=$7 in a;
             d[$7]++
       }

       # if key does not exist in array a then
       !f{
             # store order
             # copy data
             # go to next line
             o[++i]=$7;
             cp();
             next
       }

       # if f is true and 3rd field value of previously stored data
       # is greater than current record field3 data then
       # we got smaller value lets save it
       # and go to next line

       f && b[$7]>$3{
              cp();
              next
       }

       # if variable f is true and field3 value of previously stored data
       # is equal to current record field3 its tie, 
       # and check whether 6th field value is lesser than previously stored value
       # then we got smaller value from field6 of current row/record/line
       # copy data
       # go to next line

       f && b[$7]==$3 && $6<c[$7]{
              cp();
              next
       }

       # end block
       # loop through array o,
       # print value from array a, where index being o[i]
       # d[o[i]] holds count of occurrence of field7
       # if its greater than 1 then its multi otherwise unique

    END{
            for(i=1; i in o; i++)
                 print a[o[i]],(d[o[i]]>1?"multi":"unique")
       }
      ' file

<强>输入

$ cat file
Col1  Col2  Col3  Col4  Col5  Col6  Col7
blah  blah  0     blah  blah  3     AAAA
blah  blah  1     blah  blah  3     BBBB
blah  blah  2     blah  blah  0     AAAA
blah  blah  0     blah  blah  0     AAAA
blah  blah  1     blah  blah  1     BBBB
blah  blah  8     blah  blah  3     CCCC

执行

$ awk 'FNR==1{print $0,"Col8";next}function cp(){a[$7]=$0;b[$7]=$3;c[$7]=$6;{f=$7 in a;  d[$7]++}!f{o[++i]=$7;cp();next}f && b[$7]>$3{cp();next}f && b[$7]==$3 && $6<c[$7]{cp();next}END{for(i=1; i in o; i++)print a[o[i]],(d[o[i]]>1?"multi":"unique")}' file

<强>输出

Col1  Col2  Col3  Col4  Col5  Col6  Col7 Col8
blah  blah  0     blah  blah  0     AAAA multi
blah  blah  1     blah  blah  1     BBBB multi
blah  blah  8     blah  blah  3     CCCC unique