从哈希中获取fasta文件序列

时间:2016-05-29 18:21:17

标签: perl

我正在尝试将FASTA文件放入哈希值,以便稍后可以操作它,ID为键,序列为值。但我的输出只打印最后一个ID并将所有序列连接在一起。

输入

r = 0;
t1 = clock ();
for (v = 0; v < 2000000 - 1; v++) r += isprime2 (v);
t2 = clock ();
printf (" isprime2 (%lf sec) - %u primes\n", (t2-t1)/CLOCKS_PER_SEC, r);

r = 0;
t1 = clock ();
for (v = 0; v < 2000000 - 1; v++) r += isprime3 (v);
t2 = clock ();
printf (" isprime3 (%lf sec) - %u primes\n", (t2-t1)/CLOCKS_PER_SEC, r);

我的输出是

>cel-mir-35 MI0000006 Caenorhabditis elegans miR-35 stem-loop
UCUCGGAUCAGAUCGAGCCAUUGCUGGUUUCUUCCACAGUGGUACUUUCCAUUAGAACUA
UCACCGGGUGGAAACUAGCAGUGGCUCGAUCUUUUCC

>cel-mir-36 MI0000007 Caenorhabditis elegans miR-36 stem-loop
CACCGCUGUCGGGGAACCGCGCCAAUUUUCGCUUCAGUGCUAGACCAUCCAAAGUGUCUA
UCACCGGGUGAAAAUUCGCAUGGGUCCCCGACGCGGA

>cel-mir-37 MI0000008 Caenorhabditis elegans miR-37 stem-loop
UUCUAGAAACCCUUGGACCAGUGUGGGUGUCCGUUGCGGUGCUACAUUCUCUAAUCUGUA
UCACCGGGUGAACACUUGCAGUGGUCCUCGUGGUUUCU

>cel-mir-38 MI0000009 Caenorhabditis elegans miR-38 stem-loop
GUGAGCCAGGUCCUGUUCCGGUUUUUUCCGUGGUGAUAACGCAUCCAAAAGUCUCUAUCA
CCGGGAGAAAAACUGGAGUAGGACCUGUGACUCAU

我想将每个ID和相应的序列作为输出

cel-mir-38 MI0000009 Caenorhabditis elegans miR-38 stem-loop     UCUCGGAUCAGAUCGAGCCAUUGCUGGUUUCUUCCACAGUGGUACUUUCCAUUAGAACUAUCACCGGGUGGAAACUAGGGCUCGAUCUUUUCCCACCGCUGUCGGGGAACCGCGCCAAUUUUCGCUUCAGUGCUAGACCAUCCAAAGUGUCUAUCACCGGGUGAAAAUUCGCAUGGGUCCCCGACGCGGAUUCUAGAAACCCUUGGACCAGUGUGGGUGUCCGUUGCGGUGCUACAUUCUCUAAUCUGUAUCACCGGGUGAACACUUGCAGUGGUCCUCGUGGUUUCUGUGAGCCAGGUCCUGUUCCGGUUUUUUCCGUGGUGAUAACGCAUCCAAAAGUCUCUAUCACCGGGAGAAAAACUGGAGUAGGACCUGUGACUCAU
cel-mir-38 MI0000009 Caenorhabditis elegans miR-38 stem-loop     UCUCGGAUCAGAUCGAGCCAUUGCUGGUUUCUUCCACAGUGGUACUUUCCAUUAGAACUAUCACCGGGUGGAAACUAGGGCUCGAUCUUUUCCCACCGCUGUCGGGGAACCGCGCCAAUUUUCGCUUCAGUGCUAGACCAUCCAAAGUGUCUAUCACCGGGUGAAAAUUCGCAUGGGUCCCCGACGCGGAUUCUAGAAACCCUUGGACCAGUGUGGGUGUCCGUUGCGGUGCUACAUUCUCUAAUCUGUAUCACCGGGUGAACACUUGCAGUGGUCCUCGUGGUUUCUGUGAGCCAGGUCCUGUUCCGGUUUUUUCCGUGGUGAUAACGCAUCCAAAAGUCUCUAUCACCGGGAGAAAAACUGGAGUAGGACCUGUGACUCAU
cel-mir-38 MI0000009 Caenorhabditis elegans miR-38 stem-loop     UCUCGGAUCAGAUCGAGCCAUUGCUGGUUUCUUCCACAGUGGUACUUUCCAUUAGAACUAUCACCGGGUGGAAACUAGGGCUCGAUCUUUUCCCACCGCUGUCGGGGAACCGCGCCAAUUUUCGCUUCAGUGCUAGACCAUCCAAAGUGUCUAUCACCGGGUGAAAAUUCGCAUGGGUCCCCGACGCGGAUUCUAGAAACCCUUGGACCAGUGUGGGUGUCCGUUGCGGUGCUACAUUCUCUAAUCUGUAUCACCGGGUGAACACUUGCAGUGGUCCUCGUGGUUUCUGUGAGCCAGGUCCUGUUCCGGUUUUUUCCGUGGUGAUAACGCAUCCAAAAGUCUCUAUCACCGGGAGAAAAACUGGAGUAGGACCUGUGACUCAU
cel-mir-38 MI0000009 Caenorhabditis elegans miR-38 stem-loop     UCUCGGAUCAGAUCGAGCCAUUGCUGGUUUCUUCCACAGUGGUACUUUCCAUUAGAACUAUCACCGGGUGGAAACUAGGGCUCGAUCUUUUCCCACCGCUGUCGGGGAACCGCGCCAAUUUUCGCUUCAGUGCUAGACCAUCCAAAGUGUCUAUCACCGGGUGAAAAUUCGCAUGGGUCCCCGACGCGGAUUCUAGAAACCCUUGGACCAGUGUGGGUGUCCGUUGCGGUGCUACAUUCUCUAAUCUGUAUCACCGGGUGAACACUUGCAGUGGUCCUCGUGGUUUCUGUGAGCCAGGUCCUGUUCCGGUUUUUUCCGUGGUGAUAACGCAUCCAAAAGUCUCUAUCACCGGGAGAAAAACUGGAGUAGGACCUGUGACUCAU

我应该改变哪一部分?

另外,如何将序列作为键和id作为值?

3 个答案:

答案 0 :(得分:1)

您没有正确地累积哈希值,并且您也没有打印它。

    while (<FILE>) {
        chomp;

        if($_ =~ /^>(.+)/){
            $id = $1;

        } elsif (/^[A-Z]+$/) {
            $seq .= $_;

        } else {
            $fastahash{$id} = $seq;   # Populate the hash.
        }
    }

   for my $id (keys %fastahash) {
      print "$id $fastahash{$id}\n";  # Print it.

   }

答案 1 :(得分:0)

我认为,当您应该分配$_时,您需要将$seq分配给fastahash。此外,你永远不会重置id或seq,所以有一个潜在的错误。尝试这样的事情:

while (<FILE>) {
    chomp;

    if (/^>(.+)/) {
        $id = $1;
    } elsif (/^[A-Z]+$/) {
        $seq .= $_;
    } else {
        $fastahash{$id} = $seq if $id;
        $id = undef;
        $seq = '';
    }
}

$fastahash{$id} = $seq if $id;

答案 2 :(得分:0)

我意识到这不是代码审核,但我认为对您的代码做一些评论会很有用

  • 在声明变量时,通常不需要定义变量。实际上,如果将标量变量设置为空字符串

  • ,它通常会删除有用的错误消息
  • 最佳做法是使用词法文件句柄和open的三参数形式。所以

    open FILE, "file.fasta", or die $!;
    

    最好写成

    open my $fh, '<', 'file.fasta' or die $!;
    

    (请注意,您的原始代码中也有一个多余的逗号。)

    词法文件句柄通常会删除它们close的必要性,因为它们在超出范围时会被销毁时隐式关闭

  • 您可能不熟悉Perl的默认变量$_,但如果使用它,代码可以更清晰,更简洁

    您已将其与chomp一起使用,相当于chomp $_,而$_ =~ /^>(.+)/只需/^>(.+)/

  • 请注意,foreach完全等同于for,大多数熟悉Perl的程序员都会更喜欢前者

我会写你的程序

use strict;
use warnings;

open my $fh, '<', 'file.fasta' or die $!;

my %fasta_hash;
my ($id, $seq);

while ( <$fh> ) {

    chomp;

    if ( /^>(.+)/ ) {
        $id = $1;
    }
    elsif ( /\S/ and not /[^ACGTU]/ ) {
        $seq .= $_;
    }
    else {
        $fasta_hash{$id} = $seq;
    }
}

for my $id ( keys %fasta_hash ) {
    print "$id -- $fasta_hash{$id}\n";
}

输出

cel-mir-35 MI0000006 Caenorhabditis elegans miR-35 stem-loop -- UCUCGGAUCAGAUCGAGCCAUUGCUGGUUUCUUCCACAGUGGUACUUUCCAUUAGAACUAUCACCGGGUGGAAACUAGCAGUGGCUCGAUCUUUUCC
cel-mir-37 MI0000008 Caenorhabditis elegans miR-37 stem-loop -- UCUCGGAUCAGAUCGAGCCAUUGCUGGUUUCUUCCACAGUGGUACUUUCCAUUAGAACUAUCACCGGGUGGAAACUAGCAGUGGCUCGAUCUUUUCCCACCGCUGUCGGGGAACCGCGCCAAUUUUCGCUUCAGUGCUAGACCAUCCAAAGUGUCUAUCACCGGGUGAAAAUUCGCAUGGGUCCCCGACGCGGAUUCUAGAAACCCUUGGACCAGUGUGGGUGUCCGUUGCGGUGCUACAUUCUCUAAUCUGUAUCACCGGGUGAACACUUGCAGUGGUCCUCGUGGUUUCU
cel-mir-36 MI0000007 Caenorhabditis elegans miR-36 stem-loop -- UCUCGGAUCAGAUCGAGCCAUUGCUGGUUUCUUCCACAGUGGUACUUUCCAUUAGAACUAUCACCGGGUGGAAACUAGCAGUGGCUCGAUCUUUUCCCACCGCUGUCGGGGAACCGCGCCAAUUUUCGCUUCAGUGCUAGACCAUCCAAAGUGUCUAUCACCGGGUGAAAAUUCGCAUGGGUCCCCGACGCGGA

至于如何反转哈希以便将序列用作键,在我上面的版本中,您只需将行$fasta_hash{$id} = $seq;更改为$fasta_hash{$seq} = $id;