我有一个DNA序列,例如ATCGATCG。我还有一个DNA序列数据库,格式如下:
>Name of sequence1
SEQUENCEONEEXAMPLEGATCGATC
>Name of sequence2
SEQUENCETWOEXAMPLEGATCGATC
(所以奇数行包含名称,偶数行包含序列) 目前,我在数据库中搜索序列和序列之间的完美匹配,如下所示(假设所有变量都已声明):
my $name;
my $seq;
my $returnval = "The sequence does not match any in database";
open (my $database, "<", $db1) or die "Can't find db1";
until (eof $database){
chomp ($name = <$database>);
chomp ($seq = <$database>);
if (
index($seq, $entry) != -1
|| index($entry, $seq) != -1
) {
$returnval = "The sequence matches: ". $name;
last;
}
}
close $database;
有没有办法让我返回匹配率最高的序列的名称以及数据库中条目和序列之间的百分比匹配?
答案 0 :(得分:3)
String::Similarity
返回字符串之间的相似性,作为0到1,0之间的值,0表示完全不同,1表示完全相同。
my $entry = "AGGUUG" ;
my $returnval;
my $name;
my $seq;
my $currsim;
my $highestsim = 0;
my $highestname;
open (my $database, "<", $db1) or die "Can't find db1";
until (eof $database){
chomp ($name = <$database>);
chomp ($seq = <$database>);
$currsim = similarity $entry, $seq, $highestsim;
if ($currsim > $highestsim) {
$highestsim = $currsim;
$highestname = $name;
}
}
$highestsim = $highestsim * 100;
my @names = split(/>/, $highestname);
$returnval = "This sequence matches " . $names[1] . " the best with " . $highestsim . "% similarity";
close $database;