我对你所有的awk / sed / perl专家都有疑问。我遇到了一个具有以下格式的文件,例如:
>GALHOMG00000016026_1 GALHOMT00000016026_1 GALHOMP00000016026_1 JH556633.1:35740-45316 1
MPKKKTGARKKAENRREREKQIRASRANIDLAKHPCNASMECDKCQRRQKNRAFCYFCNS
VQKLPICAQCGKTKCMMKSSDCVIKHAGVYSTGLAMVGAICDFCEAWVCHGRKCLSTHAC
TCPLADAECIECERSVWDHGGRIFACSFCHDFLCEDDQFEHQASCQVLEAETFKCVSCNR
LGQHSCLRCKACFCGDHVRSKVFKQEKGKEPPCPKCGHETQQTKDLSMSTRSLKFGRQTG
GEDADGASGYDAYWKNLSSSKPGDAGDREDEYDEYEAEDDDEDDNDEGGKDSDTETTDLF
SNLNLGRTYASGYAHYEEPED
>HUMHOMG00000262990_1 HUMHOMT00000262990_1 HUMHOMP00000262990_1 JH556633.1:35740-45316 1
MPKKKTGARKKAENRREREKQIRASRANIDLAKHPCNASMECDKCQRRQKNRAFCYFCNS
VQKLPICAQCGKTKCMMKSSDCVIKHAGVYSTGLAMVGAICDFCEAWVCHGRKCLSTHAC
TCPLADAECIECERSVWDHGGRIFACSFCHDFLCEDDQFEHQASCQVLEAETFKCVSCNR
LGQHSCLRCKACFCGDHVRSKVFKQEKGKEPPCPKCGHETQQTKDLSMSTRSLKFGRQTG
GEDADGASGYDAYWKNLSSSKPGDAGDREDEYDEYEAEDDDEDDNDEGGKDSDTETTDLF
SNLNLGRTYASGYAHYEEPED
>TGUHOMG00000002432_1 TGUHOMT00000002432_1 TGUHOMP00000002432_1 JH556633.1:35740-45316 1
MPKKKTGARKKAENRREREKQIRASRANIDLAKHPCNASMECDKCQRRQKNRAFCYFCNS
VQKLPICAQCGKTKCMMKSSDCVIKHAGVYSTGLAMVGAICDFCEAWVCHGRKCLSTHAC
TCPLADAECIECERSVWDHGGRIFACSFCHDFLCEDDQFEHQASCQVLEAETFKCVSCNR
LGQHSCLRCKACFCGDHVRSKVFKQEKGKEPPCPKCGHETQQTKDLSMSTRSLKFGRQTG
GEDADGASGYDAYWKNLSSSKPGDAGDREDEYDEYEAEDDDEDDNDEGGKDSDTETTDLF
SNLNLGRTYASGYAHYEEPED
我想将此文件修改为以下内容:
>JH556633.1:35740-45316
MPKKKTGARKKAENRREREKQIRASRANIDLAKHPCNASMECDKCQRRQKNRAFCYFCNS
VQKLPICAQCGKTKCMMKSSDCVIKHAGVYSTGLAMVGAICDFCEAWVCHGRKCLSTHAC
TCPLADAECIECERSVWDHGGRIFACSFCHDFLCEDDQFEHQASCQVLEAETFKCVSCNR
LGQHSCLRCKACFCGDHVRSKVFKQEKGKEPPCPKCGHETQQTKDLSMSTRSLKFGRQTG
GEDADGASGYDAYWKNLSSSKPGDAGDREDEYDEYEAEDDDEDDNDEGGKDSDTETTDLF
SNLNLGRTYASGYAHYEEPED
我知道我可以修改我所谓的标题(我的意思是>后面的行),如下所示:
awk 'NF > 1{$0=">"$4}; {print $0}' file.fa > file2.fa
我的问题是,如何删除其他两个段落?文件中可能存在段落的字符序列(即不计算标题行)不相同的实例。在这种情况下,我想基于具有相同标识符的条目的数量附加扩展名(例如,在这种情况下JH556633.1-1:35740-45316
用于第二个的JH556633.1-2:35740-45316
或类似的东西)。关键是要使相同的标题(以>
开头的行)不同,但如果它们不相同则保留原始的字符序列。
如果有人有想法解决这个问题,我将非常感谢您的帮助。谢谢!
答案 0 :(得分:1)
这对你有用。它不依赖于不同序列之间的空行,因为并非所有的fasta文件都具有这些。它会为每个ID添加_N
,其中N
是找到ID的次数。仅与单个序列关联的ID将具有_1
。如果ID与多个不同的序列相关联,则将打印所有这些序列。
#!/usr/bin/env perl
use strict;
use warnings;
## The field of the ID line you want to keep.
## Since we start counting from 0, to get the 4th
## field, set this to 3.
my $want=3;
my (@fields,%seqs,%seen,$seq);
## Read the input file
while (<>) {
## Skip blank lines
next if /^\s*$/;
## remove trailing newlines
chomp;
## Is this an ID line?
if (/^\s*>(.*)/) {
## Save the previous sequence (if any). The %seqs
## hash has the sequence as a key and the desired
## ID as a value.
if ($fields[0]) {
$seqs{$seq}=$fields[$want];
## Clear the previous sequence and IDs
$seq="";
@fields=();
}
## Split the ID fields into @fields.
@fields=split(/\s+/);
}
## If this is a sequence, add to $seq
else {
$seq.=$_;
}
}
## Get the last sequence
$seqs{$seq}=$fields[$want];
foreach my $sequence (sort keys(%seqs)) {
## Add an identifier.
$seen{$seqs{$sequence}}++;
print ">$seqs{$sequence}_$seen{$seqs{$sequence}}\n";
## Convert the sequence back to FASTA
$sequence=~s/(.{60})/$1\n/g;
print "$sequence\n";
}
将脚本保存为foo.pl
或其他任何内容,使其可执行chmod 744 foo.pl
并运行为:
$ ./foo.pl file.fa
>JH556633.1:35740-45316_1
MPKKKTGARKKAENRREREKQIRASRANIDLAKHPCNASMECDKCQRRQKNRAFCYFCNS
VQKLPICAQCGKTKCMMKSSDCVIKHAGVYSTGLAMVGAICDFCEAWVCHGRKCLSTHAC
TCPLADAECIECERSVWDHGGRIFACSFCHDFLCEDDQFEHQASCQVLEAETFKCVSCNR
LGQHSCLRCKACFCGDHVRSKVFKQEKGKEPPCPKCGHETQQTKDLSMSTRSLKFGRQTG
GEDADGASGYDAYWKNLSSSKPGDAGDREDEYDEYEAEDDDEDDNDEGGKDSDTETTDLF
SNLNLGRTYASGYAHYEEPED
答案 1 :(得分:0)
假设$4
根据您发布的输入不能包含&
或\<digit>
(如果它可以是一个微不足道的调整):
$ awk -v RS= '!seen[$4]++{sub(/[^\n]+/,$4);print}' file
JH556633.1:35740-45316
MPKKKTGARKKAENRREREKQIRASRANIDLAKHPCNASMECDKCQRRQKNRAFCYFCNS
VQKLPICAQCGKTKCMMKSSDCVIKHAGVYSTGLAMVGAICDFCEAWVCHGRKCLSTHAC
TCPLADAECIECERSVWDHGGRIFACSFCHDFLCEDDQFEHQASCQVLEAETFKCVSCNR
LGQHSCLRCKACFCGDHVRSKVFKQEKGKEPPCPKCGHETQQTKDLSMSTRSLKFGRQTG
GEDADGASGYDAYWKNLSSSKPGDAGDREDEYDEYEAEDDDEDDNDEGGKDSDTETTDLF
SNLNLGRTYASGYAHYEEPED
看起来您还有另一个问题,所以发布一个新问题,其中包含一些代表性输入和该问题的预期输出。
答案 2 :(得分:0)
sed -n 's/^>\([^ ]\{1,\} \)\{3\}/>/;/^ *$/q;p' YourFile
基于您的示例(posix版本,因此--posix
在GNU sed上)