我必须使用Perl计算多个序列中重叠二聚体(AA,AG,AC,AT,GA,GG,GC,GT,CC,CG,CA,CT,TT,TA,TG,TC)的数量。我编写了以下代码,但它只适用于一个序列。如何将其扩展为多个序列?
#!/usr/bin/perl -w
open FH, "sample.txt";
$genome=<FH>;
%count=();
$count{substr($genome, $_, 2)}++ for 0..length($genome)-2;
print "AA: $count{AA}\n";
print "AG: $count{AG}\n";
print "AC: $count{AC}\n";
print "AT: $count{AT}\n";
print "TA: $count{TA}\n";
print "TG: $count{TG}\n";
print "TC: $count{TC}\n";
print "TT: $count{TT}\n";
print "GA: $count{GA}\n";
print "GG: $count{GG}\n";
print "GC: $count{GC}\n";
print "GT: $count{GT}\n";
print "CA: $count{CA}\n";
print "CG: $count{CG}\n";
print "CC: $count{CC}\n";
print "CT: $count{CT}\n";
我需要:
输入示例:sample.txt
ATGGGCTCCTCCGCCATCACCGTGAGCTTCCTCCTCTTTCTGGCATTTCAGCTCCCAGGGCAAACAGGAGCAAATCCCGTGTATGGCTCTGTGTCCAATGCAGACCTGATGGATTTCAAGTAAAAG
ATGGTGAGAAAATGGGCCCTGCTCCTGCCCATGCTGCTCTGCGGCCTGACTGGTCCCGCACACCTCTTCCAGCCAAGCCTGGTGCTGGAGATGGCCCAGGTCCTCTTGGATAACTACTGCTTCCCAGAGAACCTGATGGGGATGCAGGGAGCCATCGAGCAGGCCATCAAAAGTCAGGAGATTCTGTCTATCTCAGACCCTCAGACTCTGGCCCATGTGTTGACAGCTGGGGTGCAGAGCTCCTTGAATGACCCTCGCCTGGTCATCTCCTATGAGCCCAGCACCCTCGAGGCCCCTCCGCGAGCTCCAGCAGTCACGAACCTCACACTAGAGGAAATCATCGCAGGGCTGCAGGATGGCCTTCGCCATGAGATTCTGGAAGGCAATGTGGGCTACCTGCGGGTGGACGACATCCCGGGCCAGGAGGTGATGAGCAAGCTGAGGAGCTTCCTGGTGGCCAACGTCTGGAGGAAGCTCGTGAACACGTCCGCCTTGGTGCTGGACCTCCGGCACTGCACTGGGGGACACGTGTCTGGCATCCCCTATGTCATCTCCTACCTGCACCCAGGGAGCACAGTCTCGCACGTGGACACCGTCTACGACCGCCCCTCCAACACAACCACTGAGATCTGGACCCTGCCTGAAGCCCTGGGAGAGAAGTACAGTGCAGACAAGGATGTGGTGGTCCTCACCAGCAGCCGCACGGGGGGCGTGGCTGAGGACATCGCTTACATCCTCAAACAGATGCGCAGGGCCATCGTGGTGGGCGAGCGGACTGTTGGGGGGGCTCTGAACCTCCAGAAGCTGAGGGTAGGCCAGTCCGACTTCTTTCTCACTGTGCCTGTGTCCAGATCCCTGGGGCCCCTGGGTGAGGGCAGCCAGACGTGGGAGGGCAGTGGGGTGCTGCCCTGTGTGGGGACACCGGCCGAGCAGGCCCTGGAGAAAGCCCTGGCCGTTCTCATGCTGCGCAGGGCCCTGCCAGGAGTCATTCAGCGCCTTCAGGAGGCGCTGCGCGAGTACTACACGCTGGTGGACCGTGTGCCCGCCCTGCTGAGCCACCTGGCCGCCATGGACCTGTCCTCGGTGGTCTCCGAGGACGATCTGGTCACTAAGCTCAATGCTGGCCTGCAGGCTGTGTCTGAGGACCCCAGGCTCCAGGTGCAGGTGGTCAGACCCAAAGAAGCCTCTTCTGGGCCTGAGGAAGAAGCTGAAGAACCTCCAGAGGCGGTCCCGGAAGTGCCCGAGGACGAGGCTGTTCGGCGGGCTCTGGTGGACTCCGTGTTCCAGGTTTCTGTGCTGCCGGGCAACGTGGGCTACCTGCGCTTCGACAGTTTCGCTGATGCCTCTGTCCTGGAGGTGCTGGGCCCCTACATCCTGCACCAGGTGTGGGAGCCCCTGCAGGACACGGAGCACCTCATCATGGACCTGCGGCAGAACCCCGGGGGGCCGTCCTCCGCGGTGCCCCTGCTGCTCTCCTACTTCCAGAGCCCTGACGCCAGCCCCGTGCGCCTCTTCTCCACCTACGACCGGCGCACCAACATCACACGCGAGCACTTCAGCCAGACGGAGCTGCTGGGCCGGCCCTACGGCACCCAGCGTGGCGTGTACCTGCTCACTAGCCACCGCACCGCCACCGCGGCCGAGGAGCTGGCCTTCCTCATGCAGTCACTGGGCTGGGCCACGCTGGTGGGCGAGATCACCGCGGGCAGCCTGCTGCACACACACACAGTATCCCTGCTGGAGACGCCCGAGGGCGGCCTGGCGCTCACGGTGCCTGTGCTCACCTTCATCGACAACCATGGCGAGTGCTGGCTGGGGGGCGGTGTGGTCCCCGATGCCATTGTGCTGGCCGAGGAAGCCCTAGACAGAGCCCAGGAGGTGCTGGAGTTCCACCGAAGCTTGGGGGAGTTGGTGGAAGGCACGGGGCGCCTGCTGGAGGCCCACTACGCTCGGCCAGAGGTCGTGGGGCAGATGGGTGCCCTGCTGCGAGCCAAGCTGGCCCAGGGGGCCTACCGCACCGCGGTGGACCTGGAGTCGCTGGCTTCCCAGCTTACGGCCGACCTGCAGGAGATGTCTGGGGACCACCGTCTGCTGGTGTTCCACAGCCCCGGCGAAATGGTGGCTGAGGAGGCGCCCCCACCGCCTCCCGTCGTCCCCTCCCCGGAGGAGCTGTCCTATCTCATCGAGGCCCTGTTCAAGACTGAGGTGCTGCCCGGCCAGCTGGGCTACCTGCGTTTCGACGCCATGGCTGAGCTGGAGACGGTGAAGGCCGTCGGGCCACAGCTGGTGCAGCTGGTGTGGCAGAAGCTGGTGGACACGGCCGCGCTGGTGGTCGACCTGCGCTACAACCCCGGCAGCTACTCCACAGCCGTGCCTCTACTCTGCTCCTACTTCTTCGAGGCAGAGCCCCGCCGGCACCTCTACTCTGTCTTTGACAGGGCCACGTCAAGGGTCACAGAGGTATGGACCCTGCCCCACGTTACAGGCCAGCGCTATGGCTCCCACAAGGACCTCTACGTTCTGGTGAGCCACACCAGCGGTTCAGCAGCTGAGGCTTTTGCTCACACCATGCAGGATCTGCAGCGAGCCACCATCATCGGGGAGCCCACGGCCGGAGGGGCACTCTCCGTGGGAATCTACCAGGTGGGCAGCAGCGCCTTATACGCCTCCATGCCCACGCAGATGGCCATGAGTGCCAGCACCGGCGAGGCCTGGGATCTGGCTGGGGTGGAGCCGGACATCACTGTGCCCATGAGCGTGGCCCTCTCCACAGCCCGGGACATAGTGACCCTGCGTGCCAAGGTGCCCACTGTGCTGCAGACAGCTGGGAAGCTCGTAGCGGATAACTACGCCTCCCCTGAGCTGGGAGTCAAGATGGCAGCCGAACTGAGTGGTCTGCAGAGCCGCTATGCCAGGGTGACCTCAGAAGCCGCCCTCGCCGAGCTGCTGCAAGCCGACCTGCAGGTGCTGTCCGGGGACCCACACCTGAAGACAGCTCATATACCTGAGGATGCCAAAGACCGCATTCCTGGCATTGTACCCATGCAGTAACAG
ATGGACATGATGGACGGCTGCCAGTTCTCGCCCTCTGAGTACTTCTACGACGGCTCCTGCATCCCATCCCCCGACGGTGAGTTCGGGGACGAGTTTGAGCCGCGAGTGGCTGCTTTCGGGGCTCACAAGGCAGACCTGCAAGGCTCAGACGAGGACGAGCACGTGCGAGCACCCACGGGCCACCACCAGGCCGGCCACTGCCTCATGTGGGCCTGCAAAGCATGCAAAAGGAAGTCCACCACCATGGATCGGCGGAAGGCGGCCACCATGCGCGAGCGGAGACGCCTGAAGAAGGTCAACCAGGCTTTCGACACGCTCAAGCGGTGCACCACGACCAACCCTAACCAGAGGCTGCCCAAGGTGGAGATCCTCAGGAATGCCATCCGCTACATTGAGAGTCTGCAGGAGCTGCTTAGGGAACAGGTGGAAAACTACTATAGCCTGCCGGGGCAGAGCTGCTCTGAGCCCACCAGCCCCACCTCAAGTTGCTCTGATGGCATGTAAATG
答案 0 :(得分:1)
#!/usr/bin/perl
use strict;
my @dimers = qw(AA AG AC AT GA GG GC GT CC CG CA CT TT TA TG TC);
my @dimers_totals;
my $i = 1;
for(<>)
{
my $sequence = $_;
print("Sequence $i:\n");
my $j = 0;
for(@dimers)
{
my $number =()= $sequence =~ /$_/gi;
$dimers_totals[$j++] += $number;
print "\t$_: $number\n"
}
print("\n");
$i++;
}
print("Totals:\n");
$i = 0;
for(@dimers)
{
print("\t$_: $dimers_totals[$i++]\n");
}
使用
./dimers_count.pl < sample.txt
答案 1 :(得分:0)
关键是要把你所拥有的东西作为你想要构建的功能单元。您的代码适用于一个序列;将其扩展到多个序列只是摆脱只考虑一个序列的假设:
use strict;
use warnings;
use 5.010;
use Data::Dumper ();
sub count_dimers {
my ($sequence) = @_;
my %counts;
$counts{substr($sequence, $_, 2)}++ for 0..length($sequence) - 2;
my @dimers = qw(AA AG AC AT GA GG GC GT CC CG CA CT TT TA TG TC);
%counts = map { $_ => $counts{$_} } @dimers;
return %counts;
}
open(my $fh, '<', 'sample.txt');
my @counts_by_sequence;
while (my $sequence = <$fh>) {
my %sequence_counts = count_dimers($sequence);
push @counts_by_sequence, \%sequence_counts;
}
my %total_counts;
for my $sequence_counts (@counts_by_sequence) {
for my $dimer (keys %$sequence_counts) {
$total_counts{$dimer} += ${ $sequence_counts}{$dimer};
}
}
say Data::Dumper->Dump(
[\%total_counts, \@counts_by_sequence],
[qw(total_count counts_by_sequence)]
);
我将输出作为练习留给读者,但转换的一般形状应该是明显的:一旦整个程序现在是一个函数,每个序列调用一次,每个调用存储的结果和所有通话的结果总计。
答案 2 :(得分:0)
我的第一个想法是在标量上下文中进行全局匹配,并调整为pos()来备份一个。这样,我只需要扫描一次所有二聚体的字符串(而其他答案到目前为止每个二聚体扫描一次)。我保持着哈希;一个用于总数,一个用于序列。我正在使用保存当前输入行号的$.
特殊变量来标记序列:
use Data::Printer;
while( <DATA> ) {
while( /\G([ATGC]{2})/g ) {
$total{$1}++;
$by_sequence{$.}{$1}++;
pos() = pos() - 1
}
}
p( %total );
p( %by_sequence );
对substr做同样的事情需要做更多的工作,因为我无法将子字符串限制为重要的字符。最后我不得不注意新行和奇数:
use Data::Printer;
LINE: while( <DATA> ) {
chomp;
my $pos = 0;
SUB: while( my $sub = substr( $_, $pos, 2 ) ) {
last SUB unless 2 == length $sub;
$total{$sub}++;
$by_sequence{$.}{$sub}++;
$pos++;
}
}
p( %total );
p( %by_sequence );
Data::Printer的输出非常好。自从我在YAPC :: Brasil发现它以来,我一直在使用它而不是Data::Dumper。 (顺便说一下,它的作者今年将在麦迪逊的YAPC :: NA to talk about it):
{
AA 101,
AC 215,
AG 268,
AT 106,
CA 286,
CC 388,
CG 201,
CT 310,
GA 239,
GC 376,
GG 369,
GT 168,
TA 61,
TC 206,
TG 317,
TT 73
}
{
1 {
AA 9,
AC 3,
AG 8,
AT 8,
CA 11,
CC 12,
CG 3,
CT 11,
GA 5,
GC 9,
GG 8,
GT 6,
TA 2,
TC 13,
TG 10,
TT 7
},
2 {
AA 68,
AC 177,
AG 219,
AT 80,
CA 230,
CC 329,
CG 168,
CT 264,
GA 195,
GC 316,
GG 317,
GT 146,
TA 50,
TC 169,
TG 271,
TT 54
},
3 {
AA 24,
AC 35,
AG 41,
AT 18,
CA 45,
CC 47,
CG 30,
CT 35,
GA 39,
GC 51,
GG 44,
GT 16,
TA 9,
TC 24,
TG 36,
TT 12
}
}