Bash:根据第二列查找非公共行

时间:2017-09-24 10:55:36

标签: bash duplicates

我有一对看起来像这样的文件:

File_1A.txt
SNP1 pos1
SNP2 pos2
SNP3 pos3
SNP4 pos4
SNP5 pos5
SNP7 pos7

File_1B.txt
SNP1 pos1
SNP2 pos2
SNP3 pos3
SNP5 pos5
SNP6 pos6
SNP7 pos7

有关这两个文件的更多说明:

  • 他们共享大部分但不是全部的SNPID:即SNP的实际名称可能不同。例如,SNP1可以在一个中称为SNP1a而在另一个中称为SNP1b。这意味着我无法根据column1比较文件。我需要使用column2。
  • 第2列中的值(它们是我文件中的数字)是唯一的 - 即每个文件中没有重复项。

基于column2,我想找到行:   - 它存在于file_1A.txt中,但不存在于file_1B.txt中   - 它存在于file_1B.txt中,但不存在于file_1B.txt中 在这个例子中,我的输出会给我:

SNP4 pos4
SNP6 pos6    

我一直在寻找像diff这样的命令,但是他们总是给出一行的输出,这些行与另一行不同。但是,如何找到一个不存在的行,反之亦然?

非常感谢。

编辑:道歉,为了让事情更清楚,这是我的真实文件的样子:

File_1A.txt

rs13339951:45007956:T:C 45007956
rs2838331 45026728
rs5647 12345

File_1B.txt

rs13339951 45007956
rs2838331 45026728
rs55778 1235597

从这个文件中,我应该只获取这些行:

rs5647 12345
rs55778 1235597

2 个答案:

答案 0 :(得分:1)

如果您对输出的顺序不感兴趣,例如 - >它应该像Input_file(s)然后跟随可以帮助你。

awk 'FNR==NR{a[$0]=$0;next} !($0 in a){print;next} {delete a[$0]} END{for(i in a){print i}}' File_1A.txt File_1B.txt

也添加非单线形式的解决方案。

awk '
FNR==NR{
 a[$0]=$0;
 next
}
!($0 in a){
 print;
 next
}
{
 delete a[$0]
}
END{
 for(i in a){
   print i
}
}
' File_1A.txt File_1B.txt

它将确保打印File_1A.txt中不存在且存在于File_1B.txt中的所有值,反之亦然。也会很快添加解释。

代码说明: FNR==NR是一个条件,当读取第一个Input_file时它将为TRUE。现在FNRNR之间的差异都表示行号BUT FNR的值将在awk开始读取下一个文件时重置,并且NR的值将继续增加直到所有Input_file(s)正在阅读。

awk '
FNR==NR{                 ##Mentioning condition FNR==NR which will be TRUE only when first Input_file named File_1A.txt will be read.
 a[$0]=$0;               ##creating an array named a whose index is current line and value is too current line.
 next                    ##next will skip all further statements.
}
!($0 in a){              ##Checking here condition if current line is not in array a. If this condition is TRUE then enter to following block.
 print;                  ##print the current line of Input_file named File_1B.txt, which means it is not present in Input_file File_1A.txt.
 next                    ##next will skip all further statements.
}
{
 delete a[$0]            ##If above condition is NOT TRUE then it will simply delete the array a element whose index is current line because it is common in files.
}
END{
 for(i in a){            ##Starting a usual for loop here. Which is traversing through array a all elements.
   print i               ##Printing the index of array a, which will print actually those lines which are present in Input_file File_1A.txt and NOT in File_1B.txt.
}
}
' File_1A.txt File_1B.txt

EDIT2: AS op更改了字段分隔符,因此现在相应地更改了代码。不删除以前的代码,因为它可以帮助人们使用以前的Input_file数据。

awk '
FNR==NR{                 ##Mentioning condition FNR==NR which will be TRUE only when first Input_file named File_1A.txt will be read.
 a[$1]=$0;               ##creating an array named a whose index is current line and value is too current line.
 next                    ##next will skip all further statements.
}
!($1 in a){              ##Checking here condition if current line is not in array a. If this condition is TRUE then enter to following block.
 print;                  ##print the current line of Input_file named File_1B.txt, which means it is not present in Input_file File_1A.txt.
 next                    ##next will skip all further statements.
}
{
 delete a[$1]            ##If above condition is NOT TRUE then it will simply delete the array a element whose index is current line because it is common in files.
}
END{
 for(i in a){            ##Starting a usual for loop here. Which is traversing through array a all elements.
   print a[i]            ##Printing the index of array a, which will print actually those lines which are present in Input_file File_1A.txt and NOT in File_1B.txt.
}
}
' FS=':| ' File_1A.txt File_1B.txt

答案 1 :(得分:1)

如果每个文件中没有重复项,您可以:

$ awk '$2 in a{delete a[$2];next}{a[$2]=$0}END{for(i in a) print a[i]}' filea fileb
SNP6 pos6
SNP4 pos4

说明:

$2 in a {           # if 2nd column value is already hashed in a
    delete a[$2]    # delete it and skip to...
    next }          # next record
{
    a[$2]=$0 }      # else hash the record to, $2 as key
END {               # after both files pairless will remain in a
    for(i in a)     # iterate and
        print a[i]  # output them
}