我有一个我想要修改的大型Fasta文件。它基本上由许多带有标题的序列组成,这些标题以">"开头。我的问题是,即使序列是唯一的,一些标题也不是唯一的。
示例:
> acrdi | AD19082 STSTAFPLLTQFYGCAIIILVLAMCCSCLVYAMYFMNSSGLQTHESTVTQKVKDFSLQ WLQPILFGCSWRHRLIAKSRRNRSKIQPMTGTEPPWNESKDAFENLKTWALNKQNRNCLL EINFLEAKDFIVMCKDVVCFEEDDKDERNLNLCLKTLTEAFRFLRNCCAETPKNQSFVIS SGVAKQAIEVILILLRPVFQEREKGTEVITDTIRSGLQLLGNTVVKNIDTQEFIWNCCCP QFFLDVLLSRHHSIQDCLCMIIFNCLNQQRRLQLVNNPKIISQIVHLCADKSLLEWGYFI LDCLICEGLFPDLYQGMEFDPLARIILLDLFQVKITDALDESSERTERTETPKELYASSL NYLAEQFETYFIDIIQRLQQLDYSSNDFFQVLVVTRLLSLLSTSTGLKSSMTGLQDRASL LETCVDLLRETSKPEAKAAFKRPGTSYWEYVLPTFP
> acrdi | AD19082 MLRQSEPPWNESKDAFENLKTWALNKQNRNCLLEINFLEAKDFIVMCKDVVCFEEDDKDE RNLNLCLKTLTEAFRFLRNCCAETPKNQSFVISSGVAKQAIEVILILLRPVFQEREKGTE VITDTIRSGLQLLGNTVVKNIDTQEFIWNCCCPQFFLDVLLSRHHSIQDCLCMIIFNCLN QQRRLQLVNNPKIISQIVHLCADKSLLEWGYFILDCLICEGLFPDLYQGMEFDPLARIIL LDLFQVKITDALDESSERTERTETPKELYASSLNYLAEQFETYFIDIIQRLQQLDYSSND FFQVLVVTRLLSLLSTSTGLKSSMTGLQDRASLLETCVDLLRETSKPEAKAAFSNVSSFP HSVDSGRISPSHGFQRDLVRVIGNMCYQHFPNQEKVRELDGIPLLLDHCNIDDHNPYICQ WAIFAIRNVLENNKENQDIVASIHPLGLADMSRLQQFGVDAVEFDGEKI
现在我想找到我的大型Fasta文件中的所有重复项并将数字附加到重复项,以便我知道它是哪个副本(1,2,3,...,x)。当找到新的副本(一个带有另一个标题)时,计数器应该从头开始。
输出应该是这样的:
> acrdi | AD19082 STSTAFPLLTQFYGCAIIILVLAMCCSCLVYAMYFMNSSGLQTHESTVTQKVKDFSLQ WLQPILFGCSWRHRLIAKSRRNRSKIQPMTGTEPPWNESKDAFENLKTWALNKQNRNCLL EINFLEAKDFIVMCKDVVCFEEDDKDERNLNLCLKTLTEAFRFLRNCCAETPKNQSFVIS SGVAKQAIEVILILLRPVFQEREKGTEVITDTIRSGLQLLGNTVVKNIDTQEFIWNCCCP QFFLDVLLSRHHSIQDCLCMIIFNCLNQQRRLQLVNNPKIISQIVHLCADKSLLEWGYFI LDCLICEGLFPDLYQGMEFDPLARIILLDLFQVKITDALDESSERTERTETPKELYASSL NYLAEQFETYFIDIIQRLQQLDYSSNDFFQVLVVTRLLSLLSTSTGLKSSMTGLQDRASL LETCVDLLRETSKPEAKAAFKRPGTSYWEYVLPTFP
> acrdi | AD19082-1 MLRQSEPPWNESKDAFENLKTWALNKQNRNCLLEINFLEAKDFIVMCKDVVCFEEDDKDE RNLNLCLKTLTEAFRFLRNCCAETPKNQSFVISSGVAKQAIEVILILLRPVFQEREKGTE VITDTIRSGLQLLGNTVVKNIDTQEFIWNCCCPQFFLDVLLSRHHSIQDCLCMIIFNCLN QQRRLQLVNNPKIISQIVHLCADKSLLEWGYFILDCLICEGLFPDLYQGMEFDPLARIIL LDLFQVKITDALDESSERTERTETPKELYASSLNYLAEQFETYFIDIIQRLQQLDYSSND FFQVLVVTRLLSLLSTSTGLKSSMTGLQDRASLLETCVDLLRETSKPEAKAAFSNVSSFP HSVDSGRISPSHGFQRDLVRVIGNMCYQHFPNQEKVRELDGIPLLLDHCNIDDHNPYICQ WAIFAIRNVLENNKENQDIVASIHPLGLADMSRLQQFGVDAVEFDGEKI
我更喜欢使用awk或sed的方法,以便我可以轻松修改代码以在目录中的所有文件上运行。
我必须承认,我刚刚开始学习编程和解析,但我希望这不是一个愚蠢的问题。
提前THX提供帮助。答案 0 :(得分:0)
awk脚本:
BEGIN {
OFS="\n";
ORS=RS=">";
}
{
name = $1;
$1 = "";
suffix = names[name] ? "-" names[name] : "";
print name suffix $0, "\n";
names[name]++;
}
以上使用“>”作为记录分隔符,并检查第一个字段(可以复制的标题名称)。对于它打印的每一行,它会在字段出现的每个额外时间后在标题名称后面添加一个后缀(即第一个副本为“-1”,第二个为“-2”......)