我正在尝试编写一个bash脚本,该脚本能够从文件中读取DNA序列(文件中的每一行是一个序列),其中序列由空行分隔。然后我找到这些DNA序列编码的氨基酸每个密码子(每组三个文字。)例如,如果我有一个序列文件:
GCATGCTGCGATAACTTTGGCTGAACTTTGGCTGAAGCATGCTGCGAAACTTTGGCTGAACTTTGGCTG
然后从GCA(前三个文字)开始,我想根据下表将DNA解码为氨基酸:
Codon(s) Amino-acid
TTT,TTC Phe
TTA,TTG,CTT,CTC,CTA,CTG Leu
ATT,ATC,ATA Ile
ATG Met
GTT,GTC,GTA,GTG Val
TCT,TCC,TCA,TCG Ser
CCT,CCC,CCA,CCG Pro
ACT,ACC,ACA,ACG Thr
GCT,GCC,GCA,GCG Ala
TAT,TAC Tyr
TAA,TAG Stop
CAT,CAC His
CAA,CAG Gln
AAT,AAC Asn
AAA,AAG Lys
GAT,GAC Asp
GAA,GAG Glu
TGT,TGC Cys
TGA Stop
TGG Trp
CGT,CGC,CGA,CGG Arg
AGT,AGC Ser
AGA,AGG Arg
GGT,GGC,GGA,GGG Gly
也就是说,我需要得到:
AlaCysCysAspAsnPheGlyStopThrLeuAlaGluAlaCysCysGluThrLeuAlaGluLeuTrpLeu
然后我需要打印每种氨基酸的名称和使用次数。例如:
Ala: 4
Cys: 4
等等。我有100个带有DNA序列的文件,但我对bash并不擅长。我尝试了awk和tr,但我不知道如何将表编码为bash脚本。
答案 0 :(得分:0)
嗯,这是一个有趣的练习:
#!/usr/bin/perl
use strict;
use warnings;
my %acid_of;
{
my $raw = <<'***';
TTT,TTC Phe
TTA,TTG,CTT,CTC,CTA,CTG Leu
ATT,ATC,ATA Ile
ATG Met
GTT,GTC,GTA,GTG Val
TCT,TCC,TCA,TCG Ser
CCT,CCC,CCA,CCG Pro
ACT,ACC,ACA,ACG Thr
GCT,GCC,GCA,GCG Ala
TAT,TAC Tyr
TAA,TAG Stop
CAT,CAC His
CAA,CAG Gln
AAT,AAC Asn
AAA,AAG Lys
GAT,GAC Asp
GAA,GAG Glu
TGT,TGC Cys
TGA Stop
TGG Trp
CGT,CGC,CGA,CGG Arg
AGT,AGC Ser
AGA,AGG Arg
GGT,GGC,GGA,GGG Gly
***
for my $line (split /\n/, $raw) {
my ($codons, $acid) = split ' ', $line;
for my $codon (split /,/, $codons) {
$acid_of{$codon} = $acid;
}
}
}
while (my $line = readline) {
next if $line !~ /\S/;
my %count;
$line =~ s{\G([ACGT]{3})}{
my $acid = $acid_of{$1};
$count{$acid}++;
$acid
}eg;
for my $acid (sort keys %count) {
$line .= "$acid: $count{$acid}\n";
}
} continue {
print $line;
}