合并两个文件,同时在给定的awk列中保留较大值的行

时间:2018-08-16 23:14:00

标签: bash shell awk grep

我有两个制表符分隔的文件

A 500 50
A 600 30
B 300 100
C 600 40

A 500 70
A 600 30
B 300 90

并希望合并这两个文件,对于第1列和第2列中的匹配行,我想在第3列中保留较大的值。

这样输出将是:

A 500 70
A 600 30
B 300 100
C 600 40

这些是实际值的示例

==> cut125_beng_jointvcf_varcal_geno6.txt <==
scaffold_3015                   5910            44.88210969
scaffold_3015                   5912            67.86783682
scaffold_3015                   5916            79.02675660
scaffold_3015                   5926            18.41190163
scaffold_3015                   5930            42.07625795
scaffold_3015                   5931            52.63549142
scaffold_3015                   5954            37.34609103
scaffold_3015                   5983            47.36974946
scaffold_3015                   5991            41.45881125

==> cut125_wbm_jointvcf_varcal_geno6.txt <==
scaffold_3015                   5910            50.79731830
scaffold_3015                   5916            146.20529658
scaffold_3015                   5926            184.50309487
scaffold_3015                   5930            160.27435340
scaffold_3015                   5931            172.71907060
scaffold_3015                   5954            161.39740159
scaffold_3015                   5968            146.54839149
scaffold_3015                   5983            97.01874773
scaffold_3015                   5991            73.54761456

1 个答案:

答案 0 :(得分:1)

请您尝试以下。

awk '
FNR==NR{
   a[$1,$2]=$3
   next
}
($1,$2) in a{
   $3=(a[$1,$2]>$3?a[$1,$2]:$3)
   b[$1,$2]
}
1
END{
   for(i in a){
      if(!(i in b)){
        print i,a[i]
      }
   }
}' SUBSEP=" "  Input_file1  Input_file2

这将同时处理在两个Input_files中也不常见的那些元素,因此,如果Input_file1中没有该元素,而Input_file2中也没有该元素,反之亦然。

说明: 也为上述代码添加了说明。

awk '
FNR==NR{                        ##Checking condition FNR==NR which will be TRUE when first Input_file name Input_file1 is being read.
   a[$1,$2]=$3                  ##Creating array a whose index is $1,$2 and value is $3 of current line.
   next                         ##next is awk out of box keyword to skip all further statements.
}
($1,$2) in a{                   ##Checking conditoin here if Input_file2 $1,$2 of current line is coming in array a then do following.
   $3=(a[$1,$2]>$3?a[$1,$2]:$3)   ##Re-creating $3(3rd column) of current line where if value of a[$1,$2] is greater than $3 than change it to a[$1,$2] else keep it $3.
   b[$1,$2]                     ##Creating an array named b whose index is $1,$2 by this we are keeping track whichever line common in Input_file1 and Input_file2.
}
1                               ##BY mentioning 1 it will print the current line(edited or non-edited by $3).
END{                            ##Starting END block of current awk code here.
   for(i in a){                 ##Starting for loop to traverse through array a.
      if(!(i in b)){            ##Checking if index i is NOT present in array b means un-common lines which did not get print from Input-file1.
        print i,a[i]            ##Printing index i and array a value a[i] here.
      }
   }
}' SUBSEP=" " Input_file1  Input_file2      ##Mentioning SUBSEP value as space and mentioning Input_file1 and Input_file2 here.


编辑: :根据OP,输出行应以Input_file2和Input_file1的相同顺序排列,然后添加以下解决方案。

awk '
FNR==NR{                        ##Checking condition FNR==NR which will be TRUE when first Input_file name Input_file1 is being read.
   a[$1,$2]=$3                  ##Creating array a whose index is $1,$2 and value is $3 of current line.
   if(!b[$1,$2]++){             ##Checking condition here if $1 and $2 is NOT having any index on array b then do following.
     d[++count]=$1 OFS $2}      ##Creating array named d whose index is increasing variable count with value of $1 OFS $2 in it.
   next                         ##next is awk out of box keyword to skip all further statements.
}
($1,$2) in a{                   ##Checking conditoin here if Input_file2 $1,$2 of current line is coming in array a then do following.
   $3=a[$1,$2]>$3?a[$1,$2]:$3   ##Re-creating $3(3rd column) of current line where if value of a[$1,$2] is greater than $3 than change it to a[$1,$2] else keep it $3.
   c[$1,$2]                     ##Creating an array named b whose index is $1,$2 by this we are keeping track whichever line common in Input_file1 and Input_file2.
}
1                               ##BY mentioning 1 it will print the current line(edited or non-edited by $3).
END{                            ##Starting END block of current awk code here.
   for(i=1;i<=count;i++){       ##Starting for loop to traverse through array a.
      if(!(d[i] in c)){         ##Checking if value of array d whose index is i NOT present in array c means un-common lines which did not get print from Input-file1.
        print d[i],a[d[i]]      ##Printing value of array d whose index is i and array a value a[i] here.
      }
   }
}' SUBSEP=" " FilE1  FilE2      ##Mentioning SUBSEP value as space and mentioning Input_file1 and Input_file2 here.