我是Perl和Bioperl的新手,我正在尝试编写一个可识别相同序列实例的脚本。为了达到这个目的,我设想了一个脚本,该脚本需要2个infiles,第一个是fasta格式的多重对齐,第二个是将fasta id与其他相关信息相关联的附件文件。我的方法是通过Bio :: SeqIO读取多重对齐并将文件内容放在散列中,其中序列是键,id是值,或者id数组是序列共享的值
我认为它应该是这样的:
“AATTTGTTGTTGTACC”=> ('Seq1','Seq13'),
“TTTCTCTTTCCCAAAG”=> 'SEQ2',
目前,我认为由于在序列共享的情况下尝试将第二个id推入阵列时出错(即上例中的“Seq13”),我陷入困境。
以下是我正在使用的测试多重对齐方式:
>Seq1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>Seq2
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
>Seq13
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
到目前为止我写的代码之下:
#!/usr/bin/perl
use strict;
use warnings;
use Bio::Seq;
use Bio::SeqIO;
use Data::Dumper;
my $seqs = shift @ARGV or die "please provide a multiple alignment file and an accesory information file: $!\n";
my $info = shift @ARGV or die "please provide a multiple alignment file and an accesory information file: $!\n";
#open(INFO, '<', $info);
my $inseq = Bio::SeqIO->new(
-file => $seqs,
-format => "fasta",
);
my %hts;
while (my $seq = $inseq->next_seq) {
# print $seq->seq(), "\t", $seq->id, "\n";
if (defined $hts{$seq->seq()}) {
print "Sequence already in hash:\t$seq->id\n";
push @{$hts{$seq->seq()}}, ${$seq->id};
}
else {
$hts{$seq->seq()} = $seq->id;
}
print Dumper \%hts
}
所以我希望有一些帮助
1)我收到一个我不太了解的错误,但是相信推送声明 - &gt; 在ht_sharing.pl第24行第3行使用“strict refs”时,不能使用字符串(“Seq1”)作为ARRAY引用。
2)当if循环外的print语句处于活动状态时,它会打印我认为应该的ID(即Seq1),但在if循环内的print语句中,同一个调用$ seq-&gt; id会生成一个参考(即Bio :: Seq = HASH(0x19e7210) - &gt; id)。为什么是这样?我不明白为什么打印$ seq-&gt; id在同一个while循环中有不同的输出。
如果有人能提供澄清,我会非常感激,当然,因为对最佳实践的评论或者更好的方法来解决问题的人还很新。
干杯, 安娜
答案 0 :(得分:1)
您的代码非常接近但有一些小问题。第一个是您希望使用语法if (exists $hash{$key}) { ... }
来查看密钥是否存在,defined
告诉您该值是否已定义。第二件事是你无缘无故地取消引用你的$seq
对象。
当你在Bio :: SeqIO对象上调用方法'next_seq'时,它会返回一个Bio :: Seq对象。如果在Bio :: Seq对象上调用'id'方法,它会按预期返回ID,因此不需要任何参考。此外,没有必要明确导入Bio :: Seq(这只是一个评论,而不是一个问题)。
其他评论:
print Dumper %hts;
循环之后(即,在浏览完所有seq对象之后)进行while (my $seq ...)
调用。在您浏览文件时转储哈希在这里不是很有用。$hts{$seq->seq}++
,并查看已排序的值以查看是否有重复项。那会更快。