将DNA序列转换为氨基酸

时间:2015-10-08 07:15:33

标签: awk sed tr

我正在尝试编写一个bash脚本,该脚本能够从文件中读取DNA序列(文件中的每一行是一个序列),其中序列由空行分隔。然后我找到这些DNA序列编码的氨基酸每个密码子(每组三个文字。)例如,如果我有一个序列文件:

GCATGCTGCGATAACTTTGGCTGAACTTTGGCTGAAGCATGCTGCGAAACTTTGGCTGAACTTTGGCTG

然后从GCA(前三个文字)开始,我想根据下表将DNA解码为氨基酸:

Codon(s)                  Amino-acid
TTT,TTC                   Phe
TTA,TTG,CTT,CTC,CTA,CTG   Leu
ATT,ATC,ATA               Ile
ATG                       Met
GTT,GTC,GTA,GTG           Val
TCT,TCC,TCA,TCG           Ser
CCT,CCC,CCA,CCG           Pro
ACT,ACC,ACA,ACG           Thr
GCT,GCC,GCA,GCG           Ala
TAT,TAC                   Tyr
TAA,TAG                   Stop
CAT,CAC                   His
CAA,CAG                   Gln
AAT,AAC                   Asn
AAA,AAG                   Lys
GAT,GAC                   Asp
GAA,GAG                   Glu
TGT,TGC                   Cys
TGA                       Stop
TGG                       Trp
CGT,CGC,CGA,CGG           Arg
AGT,AGC                   Ser
AGA,AGG                   Arg
GGT,GGC,GGA,GGG           Gly

也就是说,我需要得到:

AlaCysCysAspAsnPheGlyStopThrLeuAlaGluAlaCysCysGluThrLeuAlaGluLeuTrpLeu

然后我需要打印每种氨基酸的名称和使用次数。例如:

Ala: 4
Cys: 4

等等。我有100个带有DNA序列的文件,但我对bash并不擅长。我尝试了awk和tr,但我不知道如何将表编码为bash脚本。

1 个答案:

答案 0 :(得分:0)

嗯,这是一个有趣的练习:

#!/usr/bin/perl
use strict;
use warnings;

my %acid_of;
{
    my $raw = <<'***';
TTT,TTC                   Phe
TTA,TTG,CTT,CTC,CTA,CTG   Leu
ATT,ATC,ATA               Ile
ATG                       Met
GTT,GTC,GTA,GTG           Val
TCT,TCC,TCA,TCG           Ser
CCT,CCC,CCA,CCG           Pro
ACT,ACC,ACA,ACG           Thr
GCT,GCC,GCA,GCG           Ala
TAT,TAC                   Tyr
TAA,TAG                   Stop
CAT,CAC                   His
CAA,CAG                   Gln
AAT,AAC                   Asn
AAA,AAG                   Lys
GAT,GAC                   Asp
GAA,GAG                   Glu
TGT,TGC                   Cys
TGA                       Stop
TGG                       Trp
CGT,CGC,CGA,CGG           Arg
AGT,AGC                   Ser
AGA,AGG                   Arg
GGT,GGC,GGA,GGG           Gly
***

    for my $line (split /\n/, $raw) {
        my ($codons, $acid) = split ' ', $line;
        for my $codon (split /,/, $codons) {
            $acid_of{$codon} = $acid;
        }
    }
}

while (my $line = readline) {
    next if $line !~ /\S/;

    my %count;
    $line =~ s{\G([ACGT]{3})}{
        my $acid = $acid_of{$1};
        $count{$acid}++;
        $acid
    }eg;

    for my $acid (sort keys %count) {
        $line .= "$acid: $count{$acid}\n";
    }
} continue {
    print $line;
}