我是Perl课程的学生。我正在寻找有关如何处理任务的建议。我的教授鼓励论坛。作业是:
编写一个Perl程序,它将从命令行获取两个文件,一个酶文件和一个DNA文件。用限制酶读取文件并将限制酶应用于DNA文件。
输出将是DNA片段按照它们在dna文件中出现的顺序排列。应通过将限制酶的名称附加到DNA文件的名称来构建输出文件的名称,并在它们之间加下划线。
例如,如果酶是EcoRI且DNA文件名为BC161026,则输出文件应命名为BC161026_EcoRI。
我的方法是创建一个主程序和两个子程序如下:
主: 不确定如何将我的潜艇绑在一起?
子程序$ DNA: 获取DNA文件并删除任何新行以制作单个字符串
子程序酶: 读取并存储来自命令行的酶文件中的行 解析文件的方式是将酶的首字母缩写词与切割位置分开。 将剪切的位置存储为哈希表中的正则表达式 将首字母缩略词的名称存储在哈希表中
注意酶文件格式: 酶文件遵循称为Staden的格式。示例:
AatI/AGG'CCT//
AatII/GACGT'C//
AbsI/CC'TCGAGG//
酶的首字母缩写由第一个斜线之前的字符组成(AatI,在第一个例子中。识别序列是第一个和第二个斜线之间的所有东西(AGG'CCT,再次,在第一个例子中)。切割点是在识别序列中用撇号表示 酶中dna的标准缩写如下:
R = G或A. B =不是A(C或G或T) 等...
除了推荐主要大块外,你还看到我遗漏了哪些缺失的东西吗?您能否推荐一些您认为在一起修补此程序时有用的特定工具?
输入酶的示例:TryII/RRR'TTT//
要阅读的示例字符串:CCCCCCGGGTTTCCCCCCCCCCCCAAATTTCCCCCCCCCCCCAGATTTCCCCCCCCCCGAGTTTCCCCC
输出应为:
CCCCCCGGG
TTTCCCCCCCCCCCCAAA
TTTCCCCCCCCCCCCAGA
TTTCCCCCCCCCCGAG
TTTCCCCC
答案 0 :(得分:3)
请注意,在酶中,当您将酶存储在散列中时,酶的名称应该是关键,并且该位点应该是值(因为原则上两种酶可以具有相同的位点)。
在Main例程中,您可以遍历哈希;每种酶产生一个输出文件。最直接的方法是将网站转换为正则表达式(通过其他正则表达式)并将其应用于DNA序列,但还有其他方法。 (这可能值得将其分成至少一个其他子。)
答案 1 :(得分:3)
以下是我尝试解决问题的方法(下面的代码)。
1)从参数中选取文件名,并创建相应的filehandles
2)为输出文件创建一个新的文件句柄,其格式为指定的格式
3)从第一个文件中提取“切割点”
4)第二个文件中的DNA序列在步骤 3 中获得的切割点上循环。
#!/usr/bin/perl
use strict;
use warnings;
my $file_enzyme=$ARGV[0];
my $file_dna=$ARGV[1];
open DNASEQ, ">$file_dna"."_"."$file_enzyme";
open ENZYME, "$file_enzyme";
open DNA, "$file_dna";
while (<ENZYME>) {
chomp;
if( /'(.*)\/\//) { # Extracts the cut point between ' & // in the enzyme file
my $pattern=$1;
while (<DNA>) {
chomp;
#print $pattern;
my @output=split/$pattern/,;
print DNASEQ shift @output,"\n"; #first recognized sequence being output
foreach (@output) {
print DNASEQ "$pattern$_\n"; #prefixing the remaining sequences with the cut point pattern
}
}
}
}
close DNA;
close ENZYME;
close DNASEQ;
答案 2 :(得分:3)
好的,我知道我不应该只做你的作业,但这个有一些有趣的技巧,所以我玩它。从中学习,而不仅仅是复制。我没有评论得很好,所以如果你有什么不明白的地方,请问。这有一些轻微的魔力,如果你没有在课堂上报道,你的教授会知道,所以一定要明白。
#!/usr/bin/env perl
use strict;
use warnings;
use Getopt::Long;
my ($enzyme_file, $dna_file);
my $write_output = 0;
my $verbose = 0;
my $help = 0;
GetOptions(
'enzyme=s' => \$enzyme_file,
'dna=s' => \$dna_file,
'output' => \$write_output,
'verbose' => \$verbose,
'help' => \$help
);
$help = 1 unless ($dna_file && $enzyme_file);
help() if $help; # exits
# 'Main'
my $dna = getDNA($dna_file);
my %enzymes = %{ getEnzymes($enzyme_file) }; # A function cannot return a hash, so return a hashref and then store the referenced hash
foreach my $enzyme (keys %enzymes) {
print "Applying enzyme " . $enzyme . " gives:\n";
my $dna_holder = $dna;
my ($precut, $postcut) = ($enzymes{$enzyme}{'precut'}, $enzymes{$enzyme}{'postcut'});
my $R = qr/[GA]/;
my $B = qr/[CGT]/;
$precut =~ s/R/${R}/g;
$precut =~ s/B/${B}/g;
$postcut =~ s/R/${R}/g;
$postcut =~ s/B/${B}/g;
print "\tPre-Cut pattern: " . $precut . "\n" if $verbose;
print "\tPost-Cut pattern: " . $postcut . "\n" if $verbose;
#while(1){
# if ($dna_holder =~ s/(.*${precut})(${postcut}.*)/$1/ ) {
# print "\tFound section:" . $2 . "\n" if $verbose;
# print "\tRemaining DNA: " . $1 . "\n" if $verbose;
# unshift @{ $enzymes{$enzyme}{'cut_dna'} }, $2;
# } else {
# unshift @{ $enzymes{$enzyme}{'cut_dna'} }, $dna_holder;
# print "\tNo more cuts.\n" if $verbose;
# print "\t" . $_ . "\n" for @{ $enzymes{$enzyme}{'cut_dna'} };
# last;
# }
#}
unless ($dna_holder =~ s/(${precut})(${postcut})/$1'$2/g) {
print "\tHas no effect on given strand\n" if $verbose;
}
@{ $enzymes{$enzyme}{'cut_dna'} } = split(/'/, $dna_holder);
print "\t$_\n" for @{ $enzymes{$enzyme}{'cut_dna'} };
writeOutput($dna_file, $enzyme, $enzymes{$enzyme}{'cut_dna'}) if $write_output; #Note that $enzymes{$enzyme}{'cut_dna'} is an arrayref already
print "\n";
}
sub getDNA {
my ($dna_file) = @_;
open(my $dna_handle, '<', $dna_file) or die "Cannot open file $dna_file";
my @dna_array = <$dna_handle>;
chomp(@dna_array);
my $dna = join('', @dna_array);
print "Using DNA:\n" . $dna . "\n\n" if $verbose;
return $dna;
}
sub getEnzymes {
my ($enzyme_file) = @_;
my %enzymes;
open(my $enzyme_handle, '<', $enzyme_file) or die "Cannot open file $enzyme_file";
while(<$enzyme_handle>) {
chomp;
if(m{([^/]*)/([^']*)'([^/]*)//}) {
print "Found Enzyme " . $1 . ":\n\tPre-cut: " . $2 . "\n\tPost-cut: " . $3 . "\n" if $verbose;
$enzymes{$1} = {
precut => $2,
postcut => $3,
cut_dna => [] #Added to show the empty array that will hold the cut DNA sections
};
}
}
print "\n" if $verbose;
return \%enzymes;
}
sub writeOutput {
my ($dna_file, $enzyme, $cut_dna_ref) = @_;
my $outfile = $dna_file . '_' . $enzyme;
print "\tSaving data to $outfile\n" if $verbose;
open(my $outfile_handle, '>', $outfile) or die "Cannot open $outfile for writing";
print $outfile_handle $_ . "\n" for @{ $cut_dna_ref };
}
sub help {
my $filename = (split('/', $0))[-1];
my $enzyme_text = <<'END';
AatI/AGG'CCT//
AatII/GACGT'C//
AbsI/CC'TCGAGG//
TryII/RRR'TTT//
Test/AAA'TTT//
END
my $dna_text = <<'END';
CCCCCCGGGTTTCCCCCCC
CCCCCAAATTTCCCCCCCCCCCCAGATTTC
CCCCCCCCCGAGTTTCCCCC
END
print <<END;
Usage:
$filename --enzyme (-e) <enzyme-filename> --dna (-d) <dna-filename> [options] (files may come in either order)
$filename -h (shows this help)
Options:
--verbose (-v) print additional (debugging) information
--output (-o) output final data to files
Files:
The DNA file contains one DNA string which may be broken over many lines. E.G.:
$dna_text
The enzymes file constains enzyme definitions, one per line. E.G.:
$enzyme_text
END
exit;
}
编辑:明确添加cut_dna初始化,因为这是每种酶的最终结果持有者,所以我认为更清楚地看到它会很好。
编辑2:添加了输出例程,调用,标记和相应的帮助。
编辑3:更改主程序以在删除循环时合并最佳的canavanin方法。现在它是一个全局替换,可以添加临时切割标记('),然后将切割标记分割成数组。留下旧方法作为评论,新方法是以下5行。
编辑4:用于写入多个文件的附加测试用例。 (下)
my @names = ('cat','dog','sheep');
foreach my $name (@names) { #$name is lexical, ie dies after each loop
open(my $handle, '>', $name); #open a lexical handle for the file, also dies each loop
print $handle $name; #write to the handle
#$handles closes automatically when it "goes out of scope"
}
答案 3 :(得分:2)
我知道已经有几个答案,但是嘿......我只是想试试运气,所以这是我的建议:
#!/usr/bin/perl
use warnings;
use strict;
use Getopt::Long;
my ($enz_file, $dna_file);
GetOptions( "e=s" => \$enz_file,
"d=s" => \$dna_file,
);
if (! $enz_file || ! $dna_file) {
# some help text
print STDERR<<EOF;
Usage: restriction.pl -e enzyme_file -d DNA_file
The enzyme_file should contain one enzyme entry per line.
The DNA_file may contain the sequence on one single or on
several lines; all lines will be concatenated to yield a
single string.
EOF
exit();
}
my %enz_and_patterns; # stores enzyme name and corresponding pattern
open ENZ, "<$enz_file" or die "Could not open file $enz_file: $!";
while (<ENZ>) {
if (m#^(\w+)/([\w']+)//$#) {
my $enzyme = $1; # could also remove those two lines and use
my $pattern = $2; # the match variables directly, but this is clearer
$enz_and_patterns{$enzyme} = $pattern;
}
}
close ENZ;
my $dna_sequence;
open DNA, "<$dna_file" or die "Could not open file $dna_file: $!";
while (my $line = <DNA>) {
chomp $line;
$dna_sequence .= $line; # append the current bit to the sequence
# that has been read so far
}
close DNA;
foreach my $enzyme (keys %enz_and_patterns) {
my $dna_seq_processed = $dna_sequence; # local copy so that we retain the original
# now translate the restriction pattern to a regular expression pattern:
my $pattern = $enz_and_patterns{$enzyme};
$pattern =~ s/R/[GA]/g; # use character classes
$pattern =~ s/B/[^A]/g;
$pattern =~ s/(.+)'(.+)/($1)($2)/; # remove the ', but due to the grouping
# we "remember" its position
$dna_seq_processed =~ s/$pattern/$1\n$2/g; # in effect we are simply replacing
# each ' with a newline character
my $outfile = "${dna_file}_$enzyme";
open OUT, ">$outfile" or die "Could not open file $outfile: $!";
print OUT $dna_seq_processed , "\n";
close OUT;
}
我已经使用您的TryII示例测试了我的代码,该代码运行良好。
因为这是一个可以通过编写几行非重复代码来处理的任务,所以我不觉得创建单独的子例程是合理的。我希望我会被宽恕...... :)