我试图在包含多个类似genbank的条目的文件中获得最短和最长的序列。文件示例:
LOCUS NM_182854 2912 bp mRNA linear PRI 20-APR-2016
DEFINITION Homo sapiens mRNA.
ACCESSION NM_182854
SOURCE Homo sapiens (human)
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Hominidae; Homo.
ORIGIN
1 gggcgatcag aagcaggtca cacagcctgt ttcctgtttt caaacgggga acttagaaag
61 tggcagcccc tcggcttgtc gccggagctg agaaccaaga gctcgaaggg gccatatgac
//
LOCUS NM_001323410 6992 bp mRNA linear PRI 20-APR-2016
DEFINITION Homo sapiens mRNA.
ACCESSION NM_001323410
SOURCE Homo sapiens (human)
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Hominidae; Homo.
ORIGIN
1 actacttccg gcttccccgc cccgccccgt ccccgggcgt ctccattttg gtctcaggtg
61 tggactcggc aagaaccagc gcaagaggga agcagagtta tagctacccc ggc
//
我想打印入藏号,最短序列的生物类型和最长的序列
到目前为止我的代码:
#!/usr/bin/perl
use strict;
use warnings;
print "enter file path\n";
while (my $line = <>){
chomp $line;
my @record = ($line);
foreach my $file(@record){
open(IN, "$file") or die "\n error opening file \n;/\n";
$/="//";
while (my $line = <IN>){
my @gb_seq = split ("ORIGIN", $line);
my $definition = $gb_seq[0];
my $sequence = $gb_seq[1];
$definition =~ m/ORGANISM[\s\t]+(.+)[\n\s\t]+/;
my $organism = $1;
if ($definition =~ m/ACCESSION[\s\t]+(\D\D_\d\d\d\d\d\d(\d*))[\n\s\t]+/){
my $accession = $1;
$sequence =~ s/\d//g;
$sequence =~ s/[\n\s\t]//g;
my $size = length($sequence);
my @sorted_keys = sort { $a <=> $b } keys my %size;
my $shortest = $sorted_keys[0];
my $longest = $sorted_keys[-1];
print "this is the shortest: $accession $organism size: $shortest\n";
print "this is the longest: $accession $organism size: $longest\n";
}
}}}
exit;
我考虑过将长度放在哈希中以获得最短和最长的但是那里有些错误。我收到这些错误:
Use of uninitialized value $organism in concatenation (.) or string at test.pl line 39, <IN> chunk 1
Use of uninitialized value $shortest in concatenation (.) or string at test.pl line 39, <IN> chunk 1.
Use of uninitialized value $longest in concatenation (.) or string at test.pl line 40, <IN> chunk 1.
我应该改变哪一部分?感谢
答案 0 :(得分:2)
我们需要找到极长条目,同时能够识别它们所属的记录。通过//
读取记录再次是一个好主意。但是,然后每个记录都是一个字符串,直接从中拉出序列比先将它分成行更难。因此,我们也可以逐行进行,因为所有需要的东西都有清晰的标记。
数据结构的选择很重要,取决于目的。在这里,我组织数据,以便使用元素
的哈希很容易%block = ( 'accession' => { 'type' => type, 'sequence' => sequence }, ... )
一旦数据被读入,执行搜索将通过按顺序组织这一过程得到极大的帮助。 (而不是通过加入&#39;),但这将使它很难使用。我认为这最终可能会被用于更多,并且速度的轻微损失并不重要。如果这里的唯一目标是以最佳性能回答具体问题,则其他方法将更合适。评论遵循代码。
use warnings;
use strict;
use feature qw(say);
my $file = 'data_seqs.txt';
open my $fh, '<', $file or die "Can't open $file -- $!";
# Hash, helper variables, flag (inside a sequence?), sequence-end marker
my (%block, $accession, $sequence);
my $is_seq = 0;
my $end_marker = qr(\s*//); # marks end of sequence: //
while (my $line = <$fh>)
{
chomp($line);
next if $line =~ /^\s*$/; # skip empty lines
if ($line =~ /$end_marker/) { # done with the sequence
$is_seq = 0;
$sequence = '';
next;
}
if ($line =~ /^\s*ACCESSION\s*(\w+)/) {
$accession = $1;
}
elsif ($line =~ /^\s*ORGANISM\s*(.+)/) {
$block{$accession}{'type'} = $1;
}
elsif ($line =~ /^\s*ORIGIN/) { # start sequence on next line
$is_seq = 1;
}
elsif ($is_seq) { # read (and add to) sequence
if ($line =~ /^\s*\d+\s*(.*)/) {
$block{$accession}{'sequence'} .= $1;
}
else { warn "Not sequence? Line: $line " }
}
}
# Identify keys for max and min lenght. Initialize with any keys
my ($max, $min) = keys %block;
foreach my $acc (keys %block)
{
my $current_len = length($block{$acc}{'sequence'});
if ( $current_len > length($block{$max}{'sequence'}) ) {
$max = $acc;
}
if ( $current_len < length($block{$min}{'sequence'}) ) {
$min = $acc;
}
}
say "Maximum length sequence: ACCESSION: $max, ORGANISM: " . $block{$max}{'type'};
say "Minimum length sequence: ACCESSION: $min, ORGANISM: " . $block{$min}{'type'};
use Data::Dumper;
print Dumper(\%block);
打印(省略了翻斗机的打印输出)
Maximum length sequence: ACCESSION: NM_182854, ORGANISM Homo sapiens Minimum length sequence: ACCESSION: NM_001323410, ORGANISM Homo sapiens
关于搜索效率的评论
一种常见的方法是首先构建反向查找哈希,然后使用库(例如来自List::Utils
)来查找最大值和最小值,然后查找它们所属的位置。为此,我们需要构建查找哈希,并且我们使用该库两次,而如上所述通过手工搜索使得一次遍历结构并且也更简单。另一种选择是将散列顶级键作为序列,然后直接找到max和min。但是,这样的哈希将更难以使用。
另一种方法是将数据组织到一个结构中,以便更有效地检索这些特定信息,可能是基于数组。
然而,效率增益似乎并不能证明方便性的大幅下降。如果速度证明是一个问题,那么应该考虑这一点。
如果您需要处理多个文件,只需将循环更改为while (<>)
并在命令行上提交。然后,所有这些行中的所有行将逐行读取,代码保持不变。
可能是我误解了一些条款。我不会从&#34;序列&#34;中删除空格,并且仅在第一行使用单词&#34;键入&#34;,只是为了命名几个候选项。这些很容易调整,请告诉我。
答案 1 :(得分:1)
您声明您需要两个数据 - 加入和有机体 - 用于最长和最短的序列。这意味着您的哈希值需要存储两个元素。除此之外,当你使用&#39; //&#39;作为记录分隔符,&#39; //&#39;仍然出现在每条记录的末尾。因此,当您从序列中过滤掉空格和数字时,您仍然会离开&#39; //&#39;最后。当我通过调试器运行代码时,由于这个原因,我发现长度都是2。
其他一些事情:
/x
,这样就可以包含空格以获得可读性$definition
时假设一场成功的比赛 - 更好地测试你的正则表达式并在比赛中分配,在错配上死亡$line
重命名为$chunk
,因为它包含多行鉴于上述模式,我得到了;
use v5.14;
use warnings;
print "Enter file path: ";
chomp(my $filename = <>);
open(IN, $filename) or die "\n error opening file \n;/\n";
$/ = "//" ;
my %organisms ;
while (my $chunk = <IN>) {
next if $chunk =~ /^\s*\n\s*$/ ;
my ($definition , $sequence) = split "ORIGIN", $chunk ;
my $organism ;
$definition =~ m/ ORGANISM [\s\t]+ (.+) [\n\s\t]+ /x
? $organism = $1
: die "Couldnt find ORGANISM line" ;
my $accession ;
$definition =~ m/ ACCESSION [\s\t]+ (\D\D _ \d{6} (\d*)) [\n\s\t]+ /x
? $accession = $1
: die "Cant find ACCESSION line" ;
$sequence =~ s/[\d\n\s\t\/]//g;
$organisms{ $sequence } = [ $accession , $organism ] ;
}
my @sorted_keys = sort { length $a <=> length $b } keys %organisms ;
my $shortest = $sorted_keys[0];
my $longest = $sorted_keys[-1];
say "this is the shortest: ", $organisms{$shortest}->[0],
", ", $organisms{$shortest}->[1],
" size: ", length $shortest, "\n",
" sequence: ", $shortest ;
say "this is the longest: ", $organisms{$longest}->[0],
", ", $organisms{$longest}->[1],
" size: ", length $longest, "\n",
" sequence: ", $longest ;
exit;
在运行数据时,会生成;
$ ./sequence.pl
Enter file path: data.txt
this is the shortest: NM_001323410, Homo sapiens size: 113
sequence: actacttccggcttccccgccccgccccgtccccgggcgtctccattttggtctcaggtgtggactcggcaagaaccagcgcaagagggaagcagagttatagctaccccggc
this is the longest: NM_182854, Homo sapiens size: 120
sequence: gggcgatcagaagcaggtcacacagcctgtttcctgttttcaaacggggaacttagaaagtggcagcccctcggcttgtcgccggagctgagaaccaagagctcgaaggggccatatgac
<强>更新强> 上面代码的问题在于,如果相同的序列出现在两个块中,那么数据将在哈希中被覆盖并丢失。下面是一个更新版本,它将数据存储在一个阵列数组中,这将解决问题。它产生完全相同的输出:
use v5.14;
use warnings;
print "Enter file path: ";
chomp(my $filename = <>);
open(IN, $filename) or die "\n error opening file \n;/\n";
$/ = "//" ;
my @organisms ;
while (my $chunk = <IN>) {
next if $chunk =~ /^\s*\n\s*$/ ;
my ($definition , $sequence) = split "ORIGIN", $chunk ;
my $organism ;
$definition =~ m/ ORGANISM [\s\t]+ (.+) [\n\s\t]+ /x
? $organism = $1
: die "Couldnt find ORGANISM line" ;
my $accession ;
$definition =~ m/ ACCESSION [\s\t]+ (\D\D _ \d{6} (\d*)) [\n\s\t]+ /x
? $accession = $1
: die "Cant find ACCESSION line" ;
$sequence =~ s/[\d\n\s\t\/]//g;
push @organisms, [$organism , $accession , $sequence] ;
}
my @sorted_organisms = sort { length $a->[2] <=> length $b->[2] } @organisms ;
my ($organism , $accession , $sequence) = @{ $sorted_organisms[0] };
say "this is the shortest: $accession, $organism, size: ",
length $sequence, "\n", " sequence: ", $sequence ;
($organism , $accession , $sequence) = @{ $sorted_organisms[-1] };
say "this is the longest: $accession, $organism, size: ",
length $sequence, "\n", " sequence: ", $sequence ;
exit;