我是perl的新手。还在学习。
我有一个fasta格式的文件。我想提取跨越特定位置的序列。例如,从位置200到300
>Contig[0001]
TGCATCAAAAGCTGAAAATATGTAGTCGAGAAGTCATTTCGAGAAATTGACGTTTTAAGT
TTCGGTTTCCAAATTCAACCGGATGTATCTTCGCCAATAATTGTCAGCAGTTAGAATTTC
TTTCAACATTATGAAGCCCTTTTTATATATTTTGATTCTGCATCAAAAGCTGAAAATATG
TAGTCTTGAAGTCATTTCGAGAAATCGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGG
ATGTATCTTCGCCAATAATTGTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTT
TTATATATTTTGATTCTGCATCAAAAGCTGAAAATGTGTAGTCTCGAAGTCATTTCGAGA
AATTGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTCGCCAATAATTGT
CAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTTTTACATATTTTGACCCTGCATC
AAAAGCTGAAAATATGTAGTCTCGAAGTCATTTTGAGAAGTTAGAATTTCTTTCAACATT
ATGAAGCCCTTTTTATATATTTTGATTCTGCATCAAAAGCTGAAAATATGTAGTCTCGAA
GTCWTTTCRAGAAATTGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTC
GCCAATAATTGTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTTTTATATATTT
TGACTCTGCATCAAAAGCTGAAAATATGTAGTCTCGAAGTCATTTCGAGAAATTGACGTT
我想从序列Contig[0001]
中提取位置200-300的序列。输出将是:
>Contig[0001]_200-300
AGAAATCGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTCGCCAATAATT
GTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTT
我的fasta文件中有近500个序列,并且我在包含id start end的制表符分隔文件中有所需的位置。
如果有人能帮助我,那就太好了。
非常感谢你的帮助。我不确定我能否提供包含有关职位信息的文件。
新手
答案 0 :(得分:3)
一种方式。内容script.pl
:
#!/usr/bin/env perl
use warnings;
use strict;
my ($adn, $l, $header);
while ( <> ) {
chomp;
## First line is known, a header, so print it and process next one.
if ( $. == 1 ) {
printf qq|%s_%s\n|, $_, q|200-300|;
next;
}
## Concat adn while not found a header.
if ( '>' ne substr $_, 0, 1 ) {
if ( ! $l ) { $l = length }
$adn .= $_;
if ( ! eof ) { next }
}
else {
$header = sprintf qq|%s_%s\n|, $_, q|200-300|;
}
## Extract range 200-300 and insert newlines to set same length of
## line than before.
my $s = substr $adn, 199, 100;
$s =~ s/(.{$l})/$1\n/g;
printf qq|%s\n|, $s;
undef $adn;
## If not end of file, print the header of the following adn.
if ( ! eof ) { print $header }
}
像以下一样运行:
perl script.pl infile
产量:
>Contig[0001]_200-300
AGAAATCGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTCGCCAATAAT
TGTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTT
答案 1 :(得分:2)
我假设您知道如何将其纳入您的计划。看看substr
功能。它做你想做的事。
substr EXPR,OFFSET,LENGTH
基本上你需要做的是:
my $snippet = substr($sequence, 200, 100);
再想一想,看看CPAN:有一个名为Bio::SeqReader::Fasta的模块,您可以使用它来读取文件并获取序列。在我看来,这是非常好的记录,我对此很感兴趣。所以这是一个解决方案。
use strict;
use warnings;
use feature qw(say);
use Bio::SeqReader::Fasta;
# Put your positions here (start, end)
my @positions = (
[ 200, 300 ],
[ 50, 180 ],
);
open my $fh, '<', '/path/to/file.fasta' or die $!;
my $in = new Bio::SeqReader::Fasta( fh => $fh ); # create the B::SR::F object
my $i = 0;
while ( my $so = $in->next() ) {
# $so represents one dataset and is a Bio::SeqReader::FastaRecord
say substr($so->seq(), # we want a part of the sequence string
$positions[$i]->[0], # starting position
$positions[$i]->[1] - $positions[$i]->[0]); # end - start --> length
}
close $fh;
答案 2 :(得分:2)
使用Perl和Bio :: SeqIO模块的一种方法。像:
一样运行./process_fasta.pl file.fa 200 300
./process_fasta.pl
的内容:
#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;
my $in_file = $ARGV[0];
my $start_pos = $ARGV[1];
my $end_pos = $ARGV[2];
my $in = Bio::SeqIO->new ( -file => $in_file, -format => 'fasta');
my $out = Bio::SeqIO->new( -file => ">$in_file.out", -format => 'fasta');
while (my $seq = $in->next_seq() ) {
$seq->display_id( $seq->display_id() . "_$start_pos-$end_pos" );
$out->write_seq( $seq->trunc($start_pos, $end_pos) );
}
结果:
>Contig[0001]_200-300
AGAAATCGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTCGCCAATAAT
TGTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTT
答案 3 :(得分:1)
我赞赏那些使用Bio::
模块的人,因为我更喜欢写新的东西。不过,也许以下内容会有所帮助:
use strict;
use warnings;
my $end = pop;
my $start = pop;
local $/ = '>';
while (<>) {
chomp;
next unless /(Contig.+)/;
my ($header) = "$/$1_$start-$end\n";
my $seq = ${^POSTMATCH};
$seq =~ s/\s//g;
print $header;
print +( substr $seq, $start - 1, $end - $start ) . "\n";
}
用法:perl script.pl inFile start end [>outFile]
最后一个可选参数将输出定向到文件。
示例:perl script.pl data.fasta 200 300
数据集输出:
>Contig[0001]_200-300
AGAAATCGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTCGCCAATAATTGTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTT
开始和结束参数为pop
ped off @ARGV
,然后记录分隔符设置为“&gt;”。当文件被读取时 - 一次是一个fasta记录 - 捕获标题信息,将序列保留在${^POSTMATCH}
中。从序列中删除所有空白区域。最后,重新格式化的标题为print
ed,序列中的字符范围也是如此。
希望这有帮助!
答案 4 :(得分:1)
完成工作轻量级脚本:
#!/usr/bin/env perl
use strict;
use warnings;
my $first=1;
if ( $ARGV[0] eq '-0' )
{
shift @ARGV;
$first=0;
}
die("Usage: cat <coords> | get_subseqs.pl (-0) <fasta files> > out; where coords is id, start, end. Use -0 if coords start with 0 instead of 1.\n") unless @ARGV;
# read coords
my %contigs = (); # id => [ start, end ]
while (<STDIN>) {
chomp;
my @row = split(/\t/);
die("Require tab-delim: id, start, end\n") unless @row == 3;
$contigs{$row[0]} = [ $row[1], $row[2] ];
}
# get subseqs
my ($id,$seq,$start,$end);
foreach my $infile (@ARGV) {
open(IN, '<', $infile) or die($!);
while (<IN>) {
if (/>(\S+)/) {
my $id2 = $1;
print ">${id}_$start-$end\n", reformat(substr($seq, $start-$first, $end-$start+1)) if $id;
if ( exists($contigs{$id2}) ) {
($id,$seq,$start,$end) = ($id2,'',@{$contigs{$id2}});
delete($contigs{$id2});
} else { $id=undef }
} elsif($id) { $seq .= $_ }
}
close(IN);
print ">${id}_$start-$end\n", reformat(substr($seq, $start-$first, $end-$start+1)) if $id;
$id = undef;
}
sub reformat { # add newline every 60 bases
my $seq = shift;
my $seq2 = '';
while ( length($seq) > 60 ) {
$seq2 .= substr($seq,0,60)."\n";
$seq = substr($seq,60);
}
$seq2 .= $seq."\n";
return $seq2;
}