我想找出在6个正向和反向帧中从cds转换过来的最长的蛋白质序列。
这是示例输入格式:
>111
KKKKKKKMGFSOXLKPXLLLLLLLLLLLLLLLLLMJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX
>222
WWWMPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPXKKKKKK
我想找出所有从“ M”到“ X”的字符串,计算每个字符串的长度并选择最长的字符串。
例如,在上述情况下:
脚本将找到
>111 has two matches:
MGFSOX
MJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX
>222 has one match:
MPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPX
然后计算每个匹配项的长度,并打印最长匹配项的字符串和数量,这是我想要的结果:
>111
MJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX 32
>222
MPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPX 38
但是它没有打印出答案。有谁知道如何修理它?任何建议都会有所帮助。
#!/usr/bin/perl -w
use strict;
use warnings;
my @pep=();
my $i=();
my @Xnum=();
my $n=();
my %hash=();
my @k=();
my $seq=();
$n=0;
open(IN, "<$ARGV[0]");
while(<IN>){
chomp;
if($_=~/^[^\>]/){
@pep=split(//, $_);
if($_ =~ /(X)/){
push(@Xnum, $1);
if($n >= 0 && $n <= $#Xnum){
if(@pep eq "M"){
for($i=1; $i<=$#pep; $i++){
$seq=join("",@pep);
$hash{$i}=$seq;
push(@k, $i);
}
}
elsif(@pep eq "X"){
$n=$n+1;
}
foreach (sort {$a cmp $b} @k){
print "$hash{$k[0]}\t$k[0]";
}
}
}
}
elsif($_=~/^\>/){
print "$_\n";
}
}
close IN;
答案 0 :(得分:2)
查看此Perl单线版
$ cat iris.txt
>111
KKKKKKKMGFSOXLKPXLLLLLLLLLLLLLLLLLMJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX
>222
WWWMPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPXKKKKKK
$ perl -ne ' if(!/^>/) { print "$p"; while(/(M[^M]+?X)/g ) { if(length($1)>length($x)) {$x=$1 } } print "$x ". length($x)."\n";$x="" } else { $p=$_ } ' iris.txt
>111
MJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX 32
>222
MPPPPPX 7
$
答案 1 :(得分:1)
有多种方法可以做到!
也尝试一下:
print and next if /^>/;
chomp and my @z = $_ =~ /(M[^X]*X)/g;
my $m = "";
for my $s (@z) {
$m = $s if length $s > length $m
}
say "$m\t" . length $m
输出:
>111
MJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX 32
>222
MPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPX 38
使用> = 5.14,并确保使用perl -n
单线:
perl -E 'print and next if /^>/; chomp and my @z = $_ =~ /(M[^X]*X)/g; my $m = ""; for my $s (@z) { $m = $s if length $s > length $m } say "$m\t" . length $m' -n data.txt
答案 2 :(得分:1)
这是使用reduce
中的List::Util
的解决方案。
编辑:错误地使用了maxstr
可以得到结果,但不是必需的。已对此帖子进行了重新编辑,以改用reduce
。
#!/usr/bin/perl
use strict;
use warnings;
use List::Util qw/reduce/;
open my $fh, '<', \<<EOF;
>111
KKKKKKKMGFSOXLKPXLLLLLLLLLLLLLLLLLMJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX
>222
WWWMPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPXKKKKKK
EOF
my $id;
while (<$fh>) {
chomp;
if (/^>/) {
$id = $_;
}
else {
my $data = reduce {length($a) > length($b) ? $a : $b} /M[^X]*X/g;
print "$id\n$data\t" . length($data) . "\n" if $data;
}
}
答案 3 :(得分:0)
这是我的看法。
我喜欢把Fasta文件塞进哈希表中,并以fasta名称作为键。这样,您可以仅向其添加说明,例如基本成分等
#!/usr/local/ActivePerl-5.20/bin/env perl
use strict;
use warnings;
my %prot;
open (my $fh, '<', '/Users/me/Desktop/fun_prot.fa') or die $!;
my $string = do { local $/; <$fh> };
close $fh;
chomp $string;
my @fasta = grep {/./} split (">", $string);
for my $aa (@fasta){
my ($key, $value) = split ("\n", $aa);
$value =~ s/[A-Z]*(M.*M)[A-Z]/$1/;
$prot{$key}->{'len'} = length($value);
$prot{$key}->{'prot'} = $value;
}
for my $sequence (sort { $prot{$b}->{'len'} <=> $prot{$a}->{'len'} } keys %prot){
print ">" . $sequence, "\n", $prot{$sequence}->{'prot'}, "\t", $prot{$sequence}->{'len'}, "\n";
last;
}
__DATA__
>1232
ASDFASMJJJJJMFASDFSDAFSDDFSA
>2343
AASFDFASMJJJJJJJJJJJJJJMRGQEGDAGDA
输出
>2343
MJJJJJJJJJJJJJJM 16