Perl程序寻找具有特定序列的k聚体

时间:2016-11-04 01:39:30

标签: perl

我正在尝试增强我之前编写的perl程序,以便识别以GG结尾的前1000长度23 k-mers并打印出仅出现一次的k-mers序列。但是,无论我在哪里添加reg exp,我都无法获得预期的结果。

我的代码:

#!/usr/bin/perl
use strict;
use warnings;

my $k           = 23;
my $input       = 'Fasta.fasta';
my $output      = 'Fasta2.fasta';
my $match_count = 0;

#Open File
unless ( open( FASTA, "<", $input ) ) {
    die "Unable to open fasta file", $!;
}

#Unwraps the FASTA format file
$/ = ">";

#Separate header and sequence
#Remove spaces
unless ( open( OUTPUT, ">", $output ) ) {
    die "Unable to open file", $!;
}

<FASTA>;    # discard 'first' 'empty' record

my %seen;
while ( my $line = <FASTA> ) {
    chomp $line;
    my ( $header, @seq ) = split( /\n/, $line );
    my $sequence = join '', @seq;

    for ( length($sequence) >= $k ) {
        $sequence =~ m/([ACTG]{21}[G]{2})/g;

        for my $i ( 0 .. length($sequence) - $k ) {
            my $kmer = substr( $sequence, $i, $k );

            ##while ($kmer =~ m/([ACTG]{21}[G]{2})/g){
            $match_count = $match_count + 1;
            print OUTPUT ">crispr_$match_count", "\n", "$kmer", "\n" unless $seen{$kmer}++;
        }
    }
}

输入的fasta文件如下所示:

> >2L type=chromosome_arm; loc=2L:1..23011544; ID=2L;  dbxref=REFSEQ:NT_033779,GB:AE014134; MD5=bfdfb99d39fa5174dae1e2ecd8a231cd; length=23011544; release=r5.54; species=Dmel;
CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCAT
TTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGT
GCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGA
TGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCA
TTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTG
CCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCG
CAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATT
TAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATAT
TGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAA
ACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTG
CCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCG
AGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATG
GTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAA
TAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGA
GAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTA
TATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCG
GAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGC
CAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTT
TACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATAT

依旧......

预期的结果(打印出23k-mers的GG结尾只显示序列中的一次)我希望得到:

>crispr_1
GGGTGGAGCTCCCGAAATGCAGG
>crispr_2
TTAATAAATATTGACACAGCGGG
>crispr_3
ATCGTGGGGCGTTTTGTGAAAGG
>crispr_4
AGTTTTTCACATAATCAGACAGG
>crispr_5
GTGTTGGATGAGTGTCCTCTGGG
>crispr_6
ATAGGTTGGTTGTTTTAAAAGGG
>crispr_7
AAATTTTTGTTGCCACTGAATGG
>crispr_8
AAGTTTCGAACTACGATGGTTGG
>crispr_9
CATGCTTTGTGGAAATAAGTCGG
>crispr_10
CACAGTGGGTGTTTGCACCTCGG
.... and so on

我当前的代码创建了一个fasta文件,其中包含以下内容:

>crispr_1
CGACAATGCACGACAGAGGAAGC
>crispr_2
GACAATGCACGACAGAGGAAGCA
>crispr_3
ACAATGCACGACAGAGGAAGCAG
>crispr_4
CAATGCACGACAGAGGAAGCAGA
>crispr_5
AATGCACGACAGAGGAAGCAGAA
>crispr_6
ATGCACGACAGAGGAAGCAGAAC
>crispr_7
TGCACGACAGAGGAAGCAGAACA
>crispr_8
GCACGACAGAGGAAGCAGAACAG
>crispr_9
CACGACAGAGGAAGCAGAACAGA
>crispr_10
ACGACAGAGGAAGCAGAACAGAT
.... and so on

如果我删除了

for (length($sequence) >=$k){
$sequence =~m/([ACTG]{21}[G]{2})/g;

并添加## while($ kmer = ~m /([ACTG] {21} [G] {2})/ g){

 while ($kmer =~ m/([ACTG]{21}[G]{2})/g){

我正在获取fasta文件(结果编号不正确且无法区分重复序列和唯一序列):

>crispr_1
CATTTTCTCTCCCATATTATAGG
>crispr_2
ATTTTCTCTCCCATATTATAGGG
>crispr_3
TATTGTGCTCTTTGATTTTTTGG
>crispr_4
GATTTTTTGGCAACCCAAAATGG
>crispr_5
TTTTTGGCAACCCAAAATGGTGG
>crispr_6
TTGGCAACCCAAAATGGTGGCGG
>crispr_7
ACGACAGAGAGAGAGAGCAGCGG
>crispr_8
AAATCGACAATGCACGACAGAGG
>crispr_91
TATTGTGATCTTCGATTTTTTGG
>crispr_93
TTTTTGGCAACCCAAAATGGAGG
.... and so on

我试图围绕我的代码移动正则表达式,但它们都没有产生预期的结果。我不知道我在这里做错了什么。当计数达到1000时,我还没有将程序的退出添加到代码中。

提前致谢!

2 个答案:

答案 0 :(得分:2)

我不确定我完全理解你的问题,但以下内容可能是你需要的。

<FASTA>; # discard 'first' 'empty' record

my %data;
while (my $line = <FASTA>){
    chomp $line;
    my($header, @seq) = split(/\n/, $line);
    my $sequence = join '', @seq;

    for my $i (0 .. length($sequence) - $k) {
        my $kmer = substr($sequence, $i, $k);

        $data{$kmer}++ if $kmer =~ /GG$/;
    }
}
my $i = 0;
for my $kmer (sort {$data{$b} <=> $data{$a}} keys %data) {
    printf "crispr_%d\n%s appears %d times\n", ++$i, $kmer, $data{$kmer};
    last if $i == 1000; 
}

我拥有的文件的一些输出是:

crispr_1
ggttttccggcacccgggcctgg appears 4 times
crispr_2
ccgagctgggcgagaagtagggg appears 4 times
crispr_3
gccgagctgggcgagaagtaggg appears 4 times
crispr_4
gcacccgggcctgggtggcaggg appears 4 times
crispr_5
agcagcgggatcgggttttccgg appears 4 times
crispr_6
gctgggcgagaagtaggggaggg appears 4 times
crispr_7
cccttctgcttcagtgtgaaagg appears 4 times
crispr_8
gtggcagggaagaatgtgccggg appears 4 times
crispr_9
gatcgggttttccggcacccggg appears 4 times
crispr_10
tgagggaaagtgctgctgctggg appears 4 times
crispr_11
agctgggcgagaagtaggggagg appears 4 times

. . . .

ggcacccgggcctgggtggcagg appears 4 times
crispr_50
gaatctctttactgcctggctgg appears 4 times
crispr_51
accacaacattgacagttggtgg appears 2 times
crispr_52
caacattgacagttggtggaggg appears 2 times
crispr_53
catgctcatcgtatctgtgttgg appears 2 times
crispr_54
gattaatgaagtggttattttgg appears 2 times
crispr_55
gaaaccacaacattgacagttgg appears 2 times
crispr_56
aacattgacagttggtggagggg appears 2 times
crispr_57
gacttgatcgattaatgaagtgg appears 2 times
crispr_58
acaacattgacagttggtggagg appears 2 times
crispr_59
gaaccatatattgttatcactgg appears 2 times
crispr_60
ccacagcgcccacttcaaggtgg appears 1 times
crispr_61
ctgctcctgggtgtgagcagagg appears 1 times
crispr_62
ccatatattatctgtggtttcgg appears 1 times

. . . .

<强>更新 要获得您在评论中提到的结果(如下所示),请将输出代码替换为:

my $i = 1;

while (my ($kmer, $count) = each %data) {
    next unless $count == 1;
    print "crispr_$i\n$kmer\n";
    last if $i++ == 1000;
}

回答我自己的评论首先 1000。

<FASTA>; # discard 'first' 'empty' record

my %seen;
my @kmers;
while (my $line = <FASTA>){
    chomp $line;
    my($header, @seq) = split(/\n/, $line);
    my $sequence = join '', @seq;

    for my $i (0 .. length($sequence) - $k) {
        my $kmer = substr($sequence, $i, $k);

        if ($kmer =~ /GG$/) {
            push @kmers, $kmer unless $seen{$kmer}++;
        }
    }
}

my $i = 1;
for my $kmer (@kmers) {
    next unless $seen{$kmer} == 1;
    print "crispr_$i\n$kmer\n";
    last if $i++ == 1000;
}

答案要检查以GG结尾的最后12个字符的唯一性,下面的代码就可以实现。

        if ($kmer =~ /(.{10}GG)$/) {
            my $substr = $1;
            push @kmers, $kmer unless $seen{$substr}++;
        }

my $i = 1;
for my $kmer (@kmers) {
    my $substr = substr $kmer, -12;
    next unless $seen{$substr} == 1;
    print "crispr_$i\n$kmer\n";
    last if $i++ == 1000;
}

答案 1 :(得分:0)

实际上,这段代码

$sequence =~m/([ACTG]{21}[G]{2})/g;

此行仅用于正则表达式匹配,如果您尝试打印此$sequence,它肯定会打印出原始结果。

请添加像这样的代码段

if($sequence =~/([ACTG]{21}[G]{2}$)/g) 
{


}#please remember to match the end with $.
BTW,看起来解析这个数据的多个for循环不是很合理,解析速度没有最佳效率。