我正在尝试对大量字符串(蛋白质序列)执行基于组合的过滤
我写了一组三个子程序来处理它,但我在两个方面遇到麻烦 - 一个是小的,一个是主要的。小麻烦是,当我使用List::MoreUtils 'pairwise'时,我会收到有关仅使用$a
和$b
一次并且未初始化的警告。但我相信我正确地称这种方法(根据CPAN的条目和网上的一些例子)
主要问题是错误"Can't use string ("17/32") as HASH ref while "strict refs" in use..."
似乎只有当foreach
中的&comp
循环将散列值作为字符串给出而不是评估除法运算时才会发生这种情况。我确定我犯了一个菜鸟错误,但无法在网上找到答案。我第一次看到perl代码是在上周三......
use List::Util;
use List::MoreUtils;
my @alphabet = (
'A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I',
'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V'
);
my $gapchr = '-';
# Takes a sequence and returns letter => occurrence count pairs as hash.
sub getcounts {
my %counts = ();
foreach my $chr (@alphabet) {
$counts{$chr} = ( $_[0] =~ tr/$chr/$chr/ );
}
$counts{'gap'} = ( $_[0] =~ tr/$gapchr/$gapchr/ );
return %counts;
}
# Takes a sequence and returns letter => fractional composition pairs as a hash.
sub comp {
my %comp = getcounts( $_[0] );
foreach my $chr (@alphabet) {
$comp{$chr} = $comp{$chr} / ( length( $_[0] ) - $comp{'gap'} );
}
return %comp;
}
# Takes two sequences and returns a measure of the composition difference between them, as a scalar.
# Originally all on one line but it was unreadable.
sub dcomp {
my @dcomp = pairwise { $a - $b } @{ values( %{ comp( $_[0] ) } ) }, @{ values( %{ comp( $_[1] ) } ) };
@dcomp = apply { $_ ** 2 } @dcomp;
my $dcomp = sqrt( sum( 0, @dcomp ) ) / 20;
return $dcomp;
}
非常感谢任何答案或建议!
答案 0 :(得分:4)
您的代码中存在一些错误。首先,请注意perldoc perlop:
由于音译表是在编译时构建的,
SEARCHLIST
和REPLACEMENTLIST
都不会受到双引号插值。
所以你的计数方法不正确。我也相信你在滥用pairwise
。很难评估什么是正确的用法,因为你没有举例说明你应该通过一些简单的输入获得什么输出。
在任何情况下,我都会重写这个脚本(有一些调试语句):
#!/usr/bin/perl
use List::AllUtils qw( sum );
use YAML;
our ($a, $b);
my @alphabet = ('A' .. 'Z');
my $gap = '-';
my $seq1 = 'ABCD-EFGH--MNOP';
my $seq2 = 'EFGH-ZZZH-KLMN';
print composition_difference($seq1, $seq2);
sub getcounts {
my ($seq) = @_;
my %counts;
my $pattern = join '|', @alphabet, $gap;
$counts{$1} ++ while $seq =~ /($pattern)/g;
warn Dump \%counts;
return \%counts;
}
sub fractional_composition_pairs {
my ($seq) = @_;
my $comp = getcounts( $seq );
my $denom = length $seq - $comp->{$gap};
$comp->{$_} /= $denom for @alphabet;
warn Dump $comp;
return $comp;
}
sub composition_difference {
# I think your use of pairwise in the original script
# is very buggy unless every sequence always contains
# all the letters in the alphabet and the gap character.
# Is the gap character supposed to factor in the computations here?
my ($comp1, $comp2) = map { fractional_composition_pairs($_) } @_;
my %union;
++ $union{$_} for (keys %$comp1, keys %$comp2);
my $dcomp;
{
no warnings 'uninitialized';
$dcomp = sum map {
($comp1->{$_} - $comp2->{$_}) ** 2
} keys %union;
}
return sqrt( $dcomp ) / 20; # where did 20 come from?
}
答案 1 :(得分:2)
%{ $foo }
会将$foo
视为哈希引用并将其取消引用;同样,@{}
将取消引用数组引用。由于comp
将哈希作为列表返回(哈希在传递给函数和从函数传递时变为列表)而不是哈希引用,因此%{}
是错误的。您可以放弃%{}
,但values
是一种特殊形式,需要哈希,而不是作为列表传递的哈希。要将comp
的结果传递给values
,comp
需要返回一个哈希引用然后被取消引用。
您的dcomp
存在另一个问题,即values
的顺序(如documentation所示)“以明显随机的顺序返回”,因此值传递给pairwise
块不一定是同一个字符。您可以使用哈希切片代替values
。我们现在回到comp
返回哈希(作为列表)。
sub dcomp {
my %ahisto = comp($_[0]);
my %bhisto = comp($_[1]);
my @keys = uniq keys %ahisto, keys %bhisto;
my @dcomp = pairwise { $a - $b } , @ahisto{@keys}, @bhisto{@keys};
@dcomp = apply { $_ ** 2 } @dcomp;
my $dcomp = sqrt( sum( 0, @dcomp ) ) / 20;
return $dcomp;
}
如果某个字符仅出现在$_[0]
和$_[1]
中,则无法解决此问题。
uniq
留给读者练习。
答案 2 :(得分:2)
只需浏览您提供的代码,这就是我写的方式。我不知道这是否会按照您希望的方式工作。
use strict;
use warnings;
our( $a, $b );
use List::Util;
use List::MoreUtils;
my @alphabet = split '', 'ARNDCQEGHILKMFPSTWYV';
my $gapchr = '-';
# Takes a sequence and returns letter => occurrence count pairs as hash.
sub getcounts {
my( $sequence ) = @_;
my %counts;
for my $chr (@alphabet) {
$counts{$chr} = () = $sequence =~ /($chr)/g;
# () = forces list context
}
$counts{'gap'} = () = $sequence =~ /($gapchr)/g;
return %counts if wantarray; # list context
return \%counts; # scalar context
# which is what happens inside of %{ }
}
# Takes a sequence and returns letter => fractional composition pairs as a hash
sub comp {
my( $sequence ) = @_;
my %counts = getcounts( $sequence );
my %comp;
for my $chr (@alphabet) {
$comp{$chr} = $comp{$chr} / ( length( $sequence ) - $counts{'gap'} );
}
return %comp if wantarray; # list context
return \%comp; # scalar context
}
# Takes two sequences and returns a measure of the composition difference
# between them, as a scalar.
sub dcomp {
my( $seq1, $seq2 ) = @_;
my @dcomp = pairwise { $a - $b }
@{[ values( %{ comp( $seq1 ) } ) ]},
@{[ values( %{ comp( $seq2 ) } ) ]};
# @{[ ]} makes a list into an array reference, then dereferences it.
# values always returns a list
# a list, or array in scalar context, returns the number of elements
# ${ } @{ } and %{ } forces their contents into scalar context
@dcomp = apply { $_ ** 2 } @dcomp;
my $dcomp = sqrt( sum( 0, @dcomp ) ) / 20;
return $dcomp;
}
您需要知道的最重要的事情之一是标量,列表和无效上下文之间的差异。这是因为一切在不同的上下文中表现不同。
答案 3 :(得分:1)
重新:小问题
这很好,是List::Util
和List::MoreUtils
模块(某些)的常见问题。
删除警告的一种方法就是提前声明那些special variables
,如下所示:
our ($a, $b);
另一个是在pairwise
之前:
no warnings 'once';
有关$ a和$ b
的更多信息,请参阅perlvar/ I3az /