基于位置提取fasta序列

时间:2013-05-14 21:16:22

标签: perl position sequence extraction fasta

我是perl的新手。还在学习。

我有一个fasta格式的文件。我想提取跨越特定位置的序列。例如,从位置200到300

>Contig[0001]
TGCATCAAAAGCTGAAAATATGTAGTCGAGAAGTCATTTCGAGAAATTGACGTTTTAAGT
TTCGGTTTCCAAATTCAACCGGATGTATCTTCGCCAATAATTGTCAGCAGTTAGAATTTC
TTTCAACATTATGAAGCCCTTTTTATATATTTTGATTCTGCATCAAAAGCTGAAAATATG
TAGTCTTGAAGTCATTTCGAGAAATCGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGG
ATGTATCTTCGCCAATAATTGTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTT
TTATATATTTTGATTCTGCATCAAAAGCTGAAAATGTGTAGTCTCGAAGTCATTTCGAGA
AATTGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTCGCCAATAATTGT
CAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTTTTACATATTTTGACCCTGCATC
AAAAGCTGAAAATATGTAGTCTCGAAGTCATTTTGAGAAGTTAGAATTTCTTTCAACATT
ATGAAGCCCTTTTTATATATTTTGATTCTGCATCAAAAGCTGAAAATATGTAGTCTCGAA
GTCWTTTCRAGAAATTGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTC
GCCAATAATTGTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTTTTATATATTT
TGACTCTGCATCAAAAGCTGAAAATATGTAGTCTCGAAGTCATTTCGAGAAATTGACGTT

我想从序列Contig[0001]中提取位置200-300的序列。输出将是:

>Contig[0001]_200-300
AGAAATCGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTCGCCAATAATT
GTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTT

我的fasta文件中有近500个序列,并且我在包含id start end的制表符分隔文件中有所需的位置。

如果有人能帮助我,那就太好了。

非常感谢你的帮助。我不确定我能否提供包含有关职位信息的文件。

新手

5 个答案:

答案 0 :(得分:3)

一种方式。内容script.pl

#!/usr/bin/env perl

use warnings;
use strict;

my ($adn, $l, $header);

while ( <> ) { 
    chomp;

    ## First line is known, a header, so print it and process next one.
    if ( $. == 1 ) { 
        printf qq|%s_%s\n|, $_, q|200-300|;
        next;
    }   

    ## Concat adn while not found a header.
    if ( '>' ne substr $_, 0, 1 ) { 
        if ( ! $l ) { $l = length }
        $adn .= $_; 
        if ( ! eof ) { next }
    }   
    else {
        $header = sprintf qq|%s_%s\n|, $_, q|200-300|;
    }   

    ## Extract range 200-300 and insert newlines to set same length of 
    ## line than before.
    my $s = substr $adn, 199, 100;
    $s =~ s/(.{$l})/$1\n/g;
    printf qq|%s\n|, $s; 
    undef $adn;

    ## If not end of file, print the header of the following adn.
    if ( ! eof ) { print $header }
}

像以下一样运行:

perl script.pl infile

产量:

>Contig[0001]_200-300
AGAAATCGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTCGCCAATAAT
TGTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTT

答案 1 :(得分:2)

我假设您知道如何将其纳入您的计划。看看substr功能。它做你想做的事。

substr EXPR,OFFSET,LENGTH 

基本上你需要做的是:

my $snippet = substr($sequence, 200, 100);

再想一想,看看CPAN:有一个名为Bio::SeqReader::Fasta的模块,您可以使用它来读取文件并获取序列。在我看来,这是非常好的记录,我对此很感兴趣。所以这是一个解决方案。

use strict;
use warnings;
use feature qw(say);
use Bio::SeqReader::Fasta;

# Put your positions here (start, end)
my @positions = (
  [ 200, 300 ], 
  [ 50, 180 ],
);

open my $fh, '<', '/path/to/file.fasta' or die $!;
my $in = new Bio::SeqReader::Fasta( fh => $fh ); # create the B::SR::F object

my $i = 0;
while ( my $so = $in->next() ) {
  # $so represents one dataset and is a Bio::SeqReader::FastaRecord

  say substr($so->seq(),               # we want a part of the sequence string
    $positions[$i]->[0],               # starting position
    $positions[$i]->[1] - $positions[$i]->[0]); # end - start --> length
}

close $fh;

答案 2 :(得分:2)

使用Perl和Bio :: SeqIO模块的一种方法。像:

一样运行
./process_fasta.pl file.fa 200 300

./process_fasta.pl的内容:

#!/usr/bin/perl 

use strict;
use warnings;
use Bio::SeqIO;

my $in_file = $ARGV[0];
my $start_pos = $ARGV[1];
my $end_pos = $ARGV[2];

my $in = Bio::SeqIO->new ( -file => $in_file, -format => 'fasta');
my $out = Bio::SeqIO->new( -file => ">$in_file.out", -format => 'fasta');


while (my $seq = $in->next_seq() ) {

    $seq->display_id( $seq->display_id() . "_$start_pos-$end_pos" );
    $out->write_seq( $seq->trunc($start_pos, $end_pos) );
}

结果:

>Contig[0001]_200-300
AGAAATCGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTCGCCAATAAT
TGTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTT

答案 3 :(得分:1)

我赞赏那些使用Bio::模块的人,因为我更喜欢写新的东西。不过,也许以下内容会有所帮助:

use strict;
use warnings;

my $end   = pop;
my $start = pop;
local $/ = '>';

while (<>) {
    chomp;
    next unless /(Contig.+)/;
    my ($header) = "$/$1_$start-$end\n";
    my $seq = ${^POSTMATCH};
    $seq =~ s/\s//g;
    print $header;
    print +( substr $seq, $start - 1, $end - $start ) . "\n";
}

用法:perl script.pl inFile start end [>outFile]

最后一个可选参数将输出定向到文件。

示例:perl script.pl data.fasta 200 300

数据集输出:

>Contig[0001]_200-300
AGAAATCGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTCGCCAATAATTGTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTT

开始和结束参数为pop ped off @ARGV,然后记录分隔符设置为“&gt;”。当文件被读取时 - 一次是一个fasta记录 - 捕获标题信息,将序列保留在${^POSTMATCH}中。从序列中删除所有空白区域。最后,重新格式化的标题为print ed,序列中的字符范围也是如此。

希望这有帮助!

答案 4 :(得分:1)

完成工作轻量级脚本:

#!/usr/bin/env perl

use strict;
use warnings;

my $first=1;
if ( $ARGV[0] eq '-0' )
{
    shift @ARGV;
    $first=0;
}

die("Usage: cat <coords> | get_subseqs.pl (-0) <fasta files> > out; where coords is id, start, end. Use -0 if coords start with 0 instead of 1.\n") unless @ARGV;

# read coords
my %contigs = (); # id => [ start, end ]
while (<STDIN>) {
    chomp;
    my @row = split(/\t/);
    die("Require tab-delim: id, start, end\n") unless @row == 3;
    $contigs{$row[0]} = [ $row[1], $row[2] ];
}

# get subseqs
my ($id,$seq,$start,$end);
foreach my $infile (@ARGV) {
    open(IN, '<', $infile) or die($!);
    while (<IN>) {
        if (/>(\S+)/) {
            my $id2 = $1;
            print ">${id}_$start-$end\n", reformat(substr($seq, $start-$first, $end-$start+1)) if $id;
            if ( exists($contigs{$id2}) ) {
                ($id,$seq,$start,$end) = ($id2,'',@{$contigs{$id2}});
                delete($contigs{$id2});
            } else { $id=undef }
        } elsif($id) { $seq .= $_ }
    }
    close(IN);
    print ">${id}_$start-$end\n", reformat(substr($seq, $start-$first, $end-$start+1)) if $id;
    $id = undef;
}

sub reformat { # add newline every 60 bases
    my $seq = shift;
    my $seq2 = '';
    while ( length($seq) > 60 ) {
        $seq2 .= substr($seq,0,60)."\n";
        $seq = substr($seq,60);
    }
    $seq2 .= $seq."\n";
    return $seq2;
}