我有一个名为fasta的磁贴类型,它包含一个标题"> 12122"后跟一个字符串。我想删除文件中的重复字符串,并只保留其中一个重复的字符串(相同的字符串)和相应的标题。
在下面的示例中,AGGTTCCGGATAAGTAAGAGCC是重复的
在:
>17-46151
AGGTTCCGGATAAGTAAGAGCC
>1-242
AGGTTCCGGATAAGTAAGAGCC
>18-41148
TCTTAACCCGGACCAGAAACTA
>43-16054
GTCCCACTCCGTAGATCTGTTC
>32-24116
TAGCATATCGAGCCTGAGAACA
>42-16312
TGATACGGATGTTATACGCAGC
out:
>1-242
AGGTTCCGGATAAGTAAGAGCC
>18-41148
TCTTAACCCGGACCAGAAACTA
>43-16054
GTCCCACTCCGTAGATCTGTTC
>32-24116
TAGCATATCGAGCCTGAGAACA
>42-16312
TGATACGGATGTTATACGCAGC
答案 0 :(得分:0)
如果订单是强制性的
# Field are delimited by new line
awk -F "\n" '
BEGIN {
# Record is delimited by ">"
RS = ">"
}
# skip first "record" due to first ">"
NR > 1 {
# if string is not know, add it to "Order" list array
if ( ! ( $2 in L ) ) O[++a] = $2
# remember (last) peer label/string
L[$2] = $1
}
# after readiong the file
END{
# display each (last know) peer based on the order
for ( i=1; i<=a; i++ ) printf( ">%s\n%s\n", L[O[i]], O[i])
}
' YourFile
如果订单不是强制性的
awk -F "\n" 'BEGIN{RS=">"}NR>1{L[$2]=$1}END{for (l in L) printf( ">%s\n%s\n", L[l], l)}' YourFile
答案 1 :(得分:0)
$ awk '{if(NR%2) p=$0; else a[$0]=p}END{for(i in a)print a[i] ORS i}' file
>18-41148
TCTTAACCCGGACCAGAAACTA
>32-24116
TAGCATATCGAGCCTGAGAACA
>1-242
AGGTTCCGGATAAGTAAGAGCC
>43-16054
GTCCCACTCCGTAGATCTGTTC
>42-16312
TGATACGGATGTTATACGCAGC
说明:
{
if(NR%2) # every first (of 2) line in p
p=$0
else # every second line is the hash key
a[$0]=p
}
END{
for(i in a) # output every unique key and it's header
print a[i] ORS i
}
答案 2 :(得分:0)
这是一款快速的单线awk解决方案。它应该比其他答案更直接,因为它逐行运行而不是排队数据(并循环遍历它)直到结束:
awk 'NR % 2 == 0 && !seen[$0]++ { print last; print } { last = $0 }' file
说明:
NR % 2 == 0
仅在偶数记录(行,NR
)!seen[$0]++
存储并递增值,仅当seen[]
哈希中没有值时才返回true(!0
为1,!1
为0,{{1} }是0等。)!2
将打印last
(标题),然后打印当前行(基因代码)注意:虽然这会保留原始订单,但它会显示第一个唯一看到的实例,而预期的输出显示 final 唯一看到的实例:
{ print last; print }
如果你想要最终唯一看到的实例,你可以在传递给awk之前反转文件,然后再将其反转:
last