我有以下脚本(由同事编写),用于搜索特定子字符串的输入文本(DNA序列),并且基本上输出每次出现此子字符串时字母数的计数: / p>
#!/usr/bin/perl
#read file in from input line
$infile = $ARGV[0];
open(TXT, "<$infile");
#open output stream
$outfile = $ARGV[1];
open(OUT, ">$outfile");
#initialize a blank string for the DNA sequence
$DNA = &read_fasta();
$len = length($DNA);
print "\n DNA Length is: $len \n";
#restriction enzyme match pattern
$pattern = "AGCT";
$match = 0;
while($DNA =~ /$pattern/gi)
{
$match++;
}
print "\n Total DNA matches to AGCT are: $match \n";
# split the DNA sequence into an array of fragments
@cutarr = split(/$pattern/i, $DNA);
#write the fragments out to a file
foreach $str(@cutarr)
{
$len = length($str);
print OUT "$len \n";
}
# Subfunction to read in a fasta file
sub read_fasta
{
$sequence = "";
while(<TXT>)
{
$line = $_;
#remove newline characters
chomp($line);
# discard fasta header line
if($line =~ /^>/){ next }
# append the line to the DNA sequence
else { $sequence .= $line }
}
return($sequence);
}
print "DNA is: \n $sequence \n";
我想知道是否有人可以帮我添加第二种搜索模式,以便脚本输出2次搜索中任意命中之间的字符数,即$ pattern1 = AGCT和$ pattern2 = GATC且输入序列是:
GGGGCC-AGCT-GAGAGACC-GATC-GAGAGAGAG-AGCT -
我只是为了显示搜索命中的位置。
输出将包括:
6
8
9
谢谢!
答案 0 :(得分:0)
您可以尝试以下脚本:
use v5.12;
use autodie;
open(my $in, "<", shift);
open(my $out, ">", shift);
my $DNA = read_fasta($in);
print "DNA is: \n $$DNA \n";
my $len = length($$DNA);
print "\n DNA Length is: $len \n";
my @pats=qw( AGCT GATC );
for (@pats) {
my $m = () = $$DNA =~ /$_/gi;
print "\n Total DNA matches to $_ are: $m \n";
}
my $pat=join("|",@pats);
my @cutarr = split(/$pat/, $$DNA);
#write the fragments out to a file
for (@cutarr) {
my $len = length($_);
print $out "$len \n";
}
close($out);
close($in);
# Subfunction to read in a fasta file
sub read_fasta {
my ($in) = @_;
my $sequence = "";
while(<$in>) {
my $line = $_;
#remove newline characters
chomp($line);
# discard fasta header line
if($line =~ /^>/){ next }
# append the line to the DNA sequence
else { $sequence .= $line }
}
return(\$sequence);
}