我正在尝试从文本文件中提取DNA序列并将其存储起来。我可以使用以下代码来完成它,但这不是最好的方法,因为我正在逐行读取文本文件。我想知道是否有更简单的方法来查找我的文本文件中的每个DNA序列,而无需逐行读取文本文件。
example.pl
#!/usr/local/bin/perl
open(MYFILE, 'data.txt');
@entire_file = <MYFILE>;
while (<MYFILE>) {
chomp;
print "$_\n";
}
$line1 = <MYFILE>;
chomp $line1;
$line2 = <MYFILE>;
chomp $line2;
$line3 = <MYFILE>;
chomp $line3;
$line4 = <MYFILE>;
chomp $line4;
$line5 = <MYFILE>;
chomp $line5;
#Prints DNA sequence 1
print "$line2";
#Prints DNA sequence 2
print "$line5";
close(MYFILE);
data.txt中
gi | 171361,Saccharomyces cerevisiae,(CYS3)基因,实验1,Joe Bloggs GCAGCGATCGACAGCTGTGCTATCCCGGCGAGCCCGTGGCAGAGGACCTCGCTTGCGAAAGCATCGAGTACC
gi | 171362,Saccharomyces cerevisiae,(CYS4)基因,实验2,Paul McDonald GAAGCGCACGACAGCTGTGCTATCCCGGCGAGCCCGTGGCAGAGGACCTCGCTTGCGAAAGCATCGAGTACC
答案 0 :(得分:3)
以下是使用BioPerl模块Bio :: SeqIO;
的示例#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;
my $in = Bio::SeqIO->new( -file => "junk.txt" ,
-format => 'FASTA');
while ( my $seq = $in->next_seq() ) {
printf "id: %s\ndescr: %s\nseq: %s\n\n", $seq->id, $seq->desc, $seq->seq;
}
__END__
Contents of junk.txt
>gi|171361, Saccharomyces cerevisiae, (CYS3) gene, Lab 1, Joe Bloggs
GCAGCGATCGACAGCTGTGCTATCCCGGCGAGCCCGTGGCAGAGGACCTCG
CTTGCGAAAGCATCGAGTACC
>gi|171362, Saccharomyces cerevisiae, (CYS4) gene, Lab 2, Paul McDonald
GAAGCGCACGACAGCTGTGCTATCCCGGCGAGCCCGTGGCAGAGGACCTCG
CTTGCGAAAGCATCGAGTACC
而且,这是运行ptogram的结果。
C:\Old_Data\perlp>perl t5.pl
id: gi|171361,
descr: Saccharomyces cerevisiae, (CYS3) gene, Lab 1, Joe Bloggs
seq: GCAGCGATCGACAGCTGTGCTATCCCGGCGAGCCCGTGGCAGAGGACCTCGCTTGCGAAAGCATCGAGTACC
id: gi|171362,
descr: Saccharomyces cerevisiae, (CYS4) gene, Lab 2, Paul McDonald
seq: GAAGCGCACGACAGCTGTGCTATCCCGGCGAGCCCGTGGCAGAGGACCTCGCTTGCGAAAGCATCGAGTACC
答案 1 :(得分:1)
之后
@entire_file = <MYFILE>;
您将整个文件保存在数组@entire_file
中。之后您使用readline运算符(<..>
)执行的所有其他操作都将无效,因为该文件已被完整读取。
您可以循环遍历数组中的元素并使用它们执行任何操作,例如,
foreach my $line (@entire_file) {
if ($line =~ /^gi/) { print "Descriptor: $line" }
else { print "Sequence: $line" }
}
我建议你阅读一般的阅读文件,模式匹配和循环。
答案 2 :(得分:1)
如果你有一个数组中的所有文件行,你可以迭代该数组以使用正则表达式获取id / descriptor和sequence元素:
use Modern::Perl;
use Data::Dumper;
my ( @id, @des, @dna );
chomp( my @FASTA = <DATA> );
for ( my $i = 0 ; $i < @FASTA ; $i += 3 ) {
my ( $id, $des ) = split ', ', $FASTA[$i], 2;
push @id, $id;
push @des, $des;
push @dna, $FASTA[ $i + 1 ];
}
say Dumper \@id, \@des, \@dna;
say @FASTA + 0;
__DATA__
>gi|171361, Saccharomyces cerevisiae, (CYS3) gene, Lab 1, Joe Bloggs
GCAGCGATCGACAGCTGTGCTATCCCGGCGAGCCCGTGGCAGAGGACCTCGCTTGCGAAAGCATCGAGTACC
>gi|171362, Saccharomyces cerevisiae, (CYS4) gene, Lab 2, Paul McDonald
GAAGCGCACGACAGCTGTGCTATCCCGGCGAGCCCGTGGCAGAGGACCTCGCTTGCGAAAGCATCGAGTACC
输出:
$VAR1 = [
'>gi|171361',
'>gi|171362'
];
$VAR2 = [
'Saccharomyces cerevisiae, (CYS3) gene, Lab 1, Joe Bloggs',
'Saccharomyces cerevisiae, (CYS4) gene, Lab 2, Paul McDonald'
];
$VAR3 = [
'GCAGCGATCGACAGCTGTGCTATCCCGGCGAGCCCGTGGCAGAGGACCTCGCTTGCGAAAGCATCGAGTACC',
'GAAGCGCACGACAGCTGTGCTATCCCGGCGAGCCCGTGGCAGAGGACCTCGCTTGCGAAAGCATCGAGTACC'
];
答案 3 :(得分:0)
如果你只是想要命令行中的序列,那么这个单行将会这样做:
perl -lane 'print $F[-1] if @F' data.txt
有关详细信息,请参阅perlrun(1)
。
使用awk
:
awk 'NF { print $NF }' data.txt