使用Perl将cDNA翻译成氨基酸

时间:2014-02-04 02:49:00

标签: arrays perl hashtable dna-sequence

所以我试图将DNA的互补链转化为它各自的氨基酸。到目前为止,我有这段代码:

#!/usr/bin/perl

open (INFILE, "sumaira2.out");
open (OUTFILE3, ">>sumaira3.out");

%aacode = (
  TTT => "F", TTC => "F", TTA => "L", TTG => "L",
  TCT => "S", TCC => "S", TCA => "S", TCG => "S",
  TAT => "Y", TAC => "Y", TAA => "STOP", TAG => "STOP",
  TGT => "C", TGC => "C", TGA => "STOP", TGG => "W",
  CTT => "L", CTC => "L", CTA => "L", CTG => "L",
  CCT => "P", CCC => "P", CCA => "P", CCG => "P",
  CAT => "H", CAC => "H", CAA => "Q", CAG => "Q",
  CGT => "R", CGC => "R", CGA => "R", CGG => "R",
  ATT => "I", ATC => "I", ATA => "I", ATG => "M",
  ACT => "T", ACC => "T", ACA => "T", ACG => "T",
  AAT => "N", AAC => "N", AAA => "K", AAG => "K",
  AGT => "S", AGC => "S", AGA => "R", AGG => "R",
  GTT => "V", GTC => "V", GTA => "V", GTG => "V",
  GCT => "A", GCC => "A", GCA => "A", GCG => "A",
  GAT => "D", GAC => "D", GAA => "E", GAG => "E",
  GGT => "G", GGC => "G", GGA => "G", GGG => "G",
); # this is the hash table for the amino acids

while ($line=<INFILE>){
  $codon = $codon.$line;
  @array = split "",$codon;
} # splits all the characters in the text

for ($count = 0; $count<scalar@array; $count= $count + 3) {
  $codon = $codon.$array[$count].$array[$count+1].$array[$count+2];
  $aminoacid = $aacode{$codon};
} # tells how to read the codon and execute the hash table

$protein = $protein.$aminoacid; #catenate the string

print OUTFILE3 $protein;

我的infile已经有反向互补的DNA,我只想翻译它。出于某种原因,我的输出中没有任何内容。我不知道出了什么问题,因为Terminal也没有给我任何错误。任何帮助都将受到高度赞赏。

以下是我要翻译的文件示例:

TCGTCGCCTCCCCAACCTAGGTAGTCCGTTGCTGCCCGACGACGGCCGGTAGTCGCCT GCGTCCCTCCTGAAAGGCGTTGGCCGGCAAGCTACGCCGTGGCTACCGGAAGCGCGTCCCCATCAC GCGGTCCTAACTGAACGCGACGGGATGGAGAGTGATCACTCCCCGCCGTCGCGTAGTTCGCCACTC

它继续增加17行。

5 个答案:

答案 0 :(得分:1)

也许以下内容会有所帮助:

use strict;
use warnings;

my %aacode = (
  TTT => "F", TTC => "F", TTA => "L", TTG => "L",
  TCT => "S", TCC => "S", TCA => "S", TCG => "S",
  TAT => "Y", TAC => "Y", TAA => "STOP", TAG => "STOP",
  TGT => "C", TGC => "C", TGA => "STOP", TGG => "W",
  CTT => "L", CTC => "L", CTA => "L", CTG => "L",
  CCT => "P", CCC => "P", CCA => "P", CCG => "P",
  CAT => "H", CAC => "H", CAA => "Q", CAG => "Q",
  CGT => "R", CGC => "R", CGA => "R", CGG => "R",
  ATT => "I", ATC => "I", ATA => "I", ATG => "M",
  ACT => "T", ACC => "T", ACA => "T", ACG => "T",
  AAT => "N", AAC => "N", AAA => "K", AAG => "K",
  AGT => "S", AGC => "S", AGA => "R", AGG => "R",
  GTT => "V", GTC => "V", GTA => "V", GTG => "V",
  GCT => "A", GCC => "A", GCA => "A", GCG => "A",
  GAT => "D", GAC => "D", GAA => "E", GAG => "E",
  GGT => "G", GGC => "G", GGA => "G", GGG => "G",
); # this is the hash table for the amino acids

my $compDNA = uc do { local $/; <> };
$compDNA =~ s/\s+//g;

my @codons = unpack '(A3)*', $compDNA;
my @aminoAcids = map { exists $aacode{$_} ? $aacode{$_} : "?$_?" } @codons;
print join '', @aminoAcids;

用法:perl script.pl compDNA_File [>aminoAcid_File]

最后一个可选参数将输出定向到文件。

首先,将整个文件篡改(并转换为全部大写)为变量。接下来,删除所有空格。 unpack用于创建三个字符元素(密码子)的列表。 map用于使用您提供的哈希将密码子翻译成氨基酸。 (注意,如果密码子没有密钥,则插入密码子,用问号括起来。)最后,那些氨基酸join形成一个单独的字符串,结果是{{1} }编

答案 1 :(得分:0)

你不想放

print OUTFILE3 $protein;

在你的for循环中,你打印出你正在处理的每一个protien,而不是你的for循环结束后你离开的最后一个,就像这样?

for ($count = 0; $count<scalar@array; $count= $count + 3) {
  $codon = $codon.$array[$count].$array[$count+1].$array[$count+2];
  $aminoacid = $aacode{$codon};

  print OUTFILE3 $aminoacid;

} # tells how to read the codon and execute the hash table

答案 2 :(得分:0)

尝试以scriptname < sumaira2.out >> sumaira3.out执行下面的脚本 如果$DEBUG按预期工作,则将#!/usr/bin/perl use strict; use warnings; my $DEBUG = 2; my %aacode = ( TTT => "F", TTC => "F", TTA => "L", TTG => "L", TCT => "S", TCC => "S", TCA => "S", TCG => "S", TAT => "Y", TAC => "Y", TAA => "STOP", TAG => "STOP", TGT => "C", TGC => "C", TGA => "STOP", TGG => "W", CTT => "L", CTC => "L", CTA => "L", CTG => "L", CCT => "P", CCC => "P", CCA => "P", CCG => "P", CAT => "H", CAC => "H", CAA => "Q", CAG => "Q", CGT => "R", CGC => "R", CGA => "R", CGG => "R", ATT => "I", ATC => "I", ATA => "I", ATG => "M", ACT => "T", ACC => "T", ACA => "T", ACG => "T", AAT => "N", AAC => "N", AAA => "K", AAG => "K", AGT => "S", AGC => "S", AGA => "R", AGG => "R", GTT => "V", GTC => "V", GTA => "V", GTG => "V", GCT => "A", GCC => "A", GCA => "A", GCG => "A", GAT => "D", GAC => "D", GAA => "E", GAG => "E", GGT => "G", GGC => "G", GGA => "G", GGG => "G", ); # this is the hash table for the amino acids my ($codon, $protein) = ('',''); while (<STDIN>){ chomp; # remove end of line characters s/\s//g; # remove whitespaces $codon .= $_; } print STDERR "DBG Codon: ", $codon, "\n" if $DEBUG >= 1; my @aminoacids = ( $codon =~ /(...)/sg ); print STDERR "Aminoacids: ", join(" ", @aminoacids), "\n" if $DEBUG >= 2; for my $aminoacid (@aminoacids) { die "Unknown aminoacid: $aminoacid\n" unless exists $aacode{$aminoacid}; $protein .= $aacode{$aminoacid}; } print STDERR "DBG Protein: ", $protein, "\n" if $DEBUG >= 1; print $protein, "\n"; 设置为零以删除调试输出。

{{1}}

答案 3 :(得分:0)

我强烈建议使用BioPerl来解决这些任务或其他一些库/工具包。原因是除了有3个阅读框外,还有16个密码子表。在我看来,人们已经在这个问题上花费了太多的精力(我也没有看到任何正确的解决方案),并且做一些超越平凡的事情将需要更多的工作和代码。以下是使用标准密码子表进行翻译的简单示例。

#!/usr/bin/env perl

use strict;
use warnings;
use Bio::SeqIO;

my $usage = "$0 nt.fasta";
my $file  = shift or die $usage;
my $seqio = Bio::SeqIO->new(-file => $file); 

my $seqobj = $seqio->next_seq;   # create a Bio::Seq object
my $trans  = $seqobj->translate; # call the translate method 
                                 # on the Bio::Seq object

print $trans->seq;               # $trans is a Bio::Seq object, 
                                 # so we call the seq method to get the sequence

您可以对多个序列稍微修改一下,或者使用不同的密码子表。您还可以包含自定义密码子表。有关翻译序列的BioPerl HOWTO页面有一个很好的教程。

编辑:我尝试过的另外两个解决方案只能处理一个序列,但是我不会像我假设的那样解析Fasta格式。一个主要的实际考虑因素是你应该在你的翻译中插入一个符号(默认是BioPerl的星形,但你可以把它更改为你想要的任何一个)而不是单词“STOP”,因为它不会被任何其他工具识别。在视觉上也难以分辨。

答案 4 :(得分:0)

好的,

所以我问我的教授,我的代码有多少问题。首先,我使用$ codon两次,同时希望它做两件不同的事情(我在while循环中使用了一次,在for循环中使用了一次)。所以它将整个infile视为$密码子,然后在它之后执行哈希表。第二件事是错误的(正如其他人之前提到的那样)是$ protein不在for循环中,因此只会给我最后一个氨基酸。无论如何,这是纠正的,有效的代码:

open (INFILE, "sumaira2.out");
open (OUTFILE3, ">sumaira3.out");

%aacode = (
TTT => "F", TTC => "F", TTA => "L", TTG => "L",
TCT => "S", TCC => "S", TCA => "S", TCG => "S",
TAT => "Y", TAC => "Y", TAA => "STOP", TAG => "STOP",
TGT => "C", TGC => "C", TGA => "STOP", TGG => "W",
CTT => "L", CTC => "L", CTA => "L", CTG => "L",
CCT => "P", CCC => "P", CCA => "P", CCG => "P",
CAT => "H", CAC => "H", CAA => "Q", CAG => "Q",
CGT => "R", CGC => "R", CGA => "R", CGG => "R",
ATT => "I", ATC => "I", ATA => "I", ATG => "M",
ACT => "T", ACC => "T", ACA => "T", ACG => "T",
AAT => "N", AAC => "N", AAA => "K", AAG => "K",
AGT => "S", AGC => "S", AGA => "R", AGG => "R",
GTT => "V", GTC => "V", GTA => "V", GTG => "V",
GCT => "A", GCC => "A", GCA => "A", GCG => "A",
GAT => "D", GAC => "D", GAA => "E", GAG => "E",
GGT => "G", GGC => "G", GGA => "G", GGG => "G",
); # this is the hash table for the amino acids

while ($line=<INFILE>){
$line =~ s/\s+$//;
$sequence = $sequence.$line;
@array = split "",$sequence;
 } # splits all the characters in the text

for ($count = 0; $count<=scalar @array-3; $count= $count + 3) {
$codon = $array[$count].$array[$count+1].$array[$count+2];
$aminoacid = $aacode{$codon};
$protein = $protein.$aminoacid; #catenate the string

} # tells how to read the codon and execute the hash table


print OUTFILE3 $protein;

再次感谢大家的帮助,抱歉我花了这么长时间才回来!