我有两个制表符分隔的文件
A 500 50
A 600 30
B 300 100
C 600 40
和
A 500 70
A 600 30
B 300 90
并希望合并这两个文件,对于第1列和第2列中的匹配行,我想在第3列中保留较大的值。
这样输出将是:
A 500 70
A 600 30
B 300 100
C 600 40
这些是实际值的示例
==> cut125_beng_jointvcf_varcal_geno6.txt <==
scaffold_3015 5910 44.88210969
scaffold_3015 5912 67.86783682
scaffold_3015 5916 79.02675660
scaffold_3015 5926 18.41190163
scaffold_3015 5930 42.07625795
scaffold_3015 5931 52.63549142
scaffold_3015 5954 37.34609103
scaffold_3015 5983 47.36974946
scaffold_3015 5991 41.45881125
==> cut125_wbm_jointvcf_varcal_geno6.txt <==
scaffold_3015 5910 50.79731830
scaffold_3015 5916 146.20529658
scaffold_3015 5926 184.50309487
scaffold_3015 5930 160.27435340
scaffold_3015 5931 172.71907060
scaffold_3015 5954 161.39740159
scaffold_3015 5968 146.54839149
scaffold_3015 5983 97.01874773
scaffold_3015 5991 73.54761456
答案 0 :(得分:1)
请您尝试以下。
awk '
FNR==NR{
a[$1,$2]=$3
next
}
($1,$2) in a{
$3=(a[$1,$2]>$3?a[$1,$2]:$3)
b[$1,$2]
}
1
END{
for(i in a){
if(!(i in b)){
print i,a[i]
}
}
}' SUBSEP=" " Input_file1 Input_file2
这将同时处理在两个Input_files中也不常见的那些元素,因此,如果Input_file1中没有该元素,而Input_file2中也没有该元素,反之亦然。
说明: 也为上述代码添加了说明。
awk '
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first Input_file name Input_file1 is being read.
a[$1,$2]=$3 ##Creating array a whose index is $1,$2 and value is $3 of current line.
next ##next is awk out of box keyword to skip all further statements.
}
($1,$2) in a{ ##Checking conditoin here if Input_file2 $1,$2 of current line is coming in array a then do following.
$3=(a[$1,$2]>$3?a[$1,$2]:$3) ##Re-creating $3(3rd column) of current line where if value of a[$1,$2] is greater than $3 than change it to a[$1,$2] else keep it $3.
b[$1,$2] ##Creating an array named b whose index is $1,$2 by this we are keeping track whichever line common in Input_file1 and Input_file2.
}
1 ##BY mentioning 1 it will print the current line(edited or non-edited by $3).
END{ ##Starting END block of current awk code here.
for(i in a){ ##Starting for loop to traverse through array a.
if(!(i in b)){ ##Checking if index i is NOT present in array b means un-common lines which did not get print from Input-file1.
print i,a[i] ##Printing index i and array a value a[i] here.
}
}
}' SUBSEP=" " Input_file1 Input_file2 ##Mentioning SUBSEP value as space and mentioning Input_file1 and Input_file2 here.
编辑: :根据OP,输出行应以Input_file2和Input_file1的相同顺序排列,然后添加以下解决方案。
awk '
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first Input_file name Input_file1 is being read.
a[$1,$2]=$3 ##Creating array a whose index is $1,$2 and value is $3 of current line.
if(!b[$1,$2]++){ ##Checking condition here if $1 and $2 is NOT having any index on array b then do following.
d[++count]=$1 OFS $2} ##Creating array named d whose index is increasing variable count with value of $1 OFS $2 in it.
next ##next is awk out of box keyword to skip all further statements.
}
($1,$2) in a{ ##Checking conditoin here if Input_file2 $1,$2 of current line is coming in array a then do following.
$3=a[$1,$2]>$3?a[$1,$2]:$3 ##Re-creating $3(3rd column) of current line where if value of a[$1,$2] is greater than $3 than change it to a[$1,$2] else keep it $3.
c[$1,$2] ##Creating an array named b whose index is $1,$2 by this we are keeping track whichever line common in Input_file1 and Input_file2.
}
1 ##BY mentioning 1 it will print the current line(edited or non-edited by $3).
END{ ##Starting END block of current awk code here.
for(i=1;i<=count;i++){ ##Starting for loop to traverse through array a.
if(!(d[i] in c)){ ##Checking if value of array d whose index is i NOT present in array c means un-common lines which did not get print from Input-file1.
print d[i],a[d[i]] ##Printing value of array d whose index is i and array a value a[i] here.
}
}
}' SUBSEP=" " FilE1 FilE2 ##Mentioning SUBSEP value as space and mentioning Input_file1 and Input_file2 here.