我的HoHoA设置如下:
#!/usr/bin/perl
use warnings;
use strict;
my %experiment = (
'gene1' => {
'condition2' => ['XLOC_000347','80', '0.5'],
'condition3' => ['XLOC_000100', '50', '0.2']
},
'gene2' => {
'condition1' => ['XLOC_025437', '100', '0.018'],
'condition2' => ['XLOC_000322', '77', '0.22'],
'condition3' => ['XLOC_001000', '43', '0.02']
},
'gene3' => {
'condition1' => ['XLOC_025437', '100', '0.018'],
'condition3' => ['XLOC_001045', '23', '0.0001']
},
'gene4' => {
'condition3' => ['XLOC_091345', '93', '0.005']
}
);
我试图找到在至少2个条件下重叠的所有'基因',并且对于每个基因,打印出具有最低值的条件(例如q_value)。然后我想按这个值排序。到目前为止,这是我的代码:
循环显示第一个键,找到第二个键中至少2个出现的所有键。
my(%overlap, %condition_name);
my ($xloc, $q_val, $percentage, %seen);
for my $gene (sort keys %experiment) {
for my $condition (sort keys %{$experiment{$gene}}) {
$condition_name{$condition} = 1;
$seen{$gene}++; # Counts for each occurrence of gene
$overlap{$gene} = 1 if $seen{$gene} > 1;
}
}
对于每个重叠实例,打印出找到key1的每个条件(key2)以及相关值:
my @cond_name = keys %condition_name;
foreach my $gene (keys %overlap){
foreach my $condition (@cond_name){
next unless exists $experiment{$gene}{$condition};
($xloc, $percentage, $q_val) = @{$experiment{$gene}{$condition}};
print "$condition\t$gene\t$xloc\t$q_val\t$percentage\n";
}
print "\n";
}
输出:
condition3 gene3 XLOC_001045 0.0001 23
condition1 gene3 XLOC_025437 0.018 100
condition3 gene1 XLOC_000100 0.2 50
condition2 gene1 XLOC_000347 0.5 80
condition3 gene2 XLOC_001000 0.02 43
condition1 gene2 XLOC_025437 0.018 100
condition2 gene2 XLOC_000322 0.22 77
我试图以两种方式改变输出:
gene 1
,我希望比较condition3
和condition2
(q_value)的第一个值,并仅保留最低值。期望的输出:
condition3 gene3 XLOC_001045 0.0001 23
condition3 gene1 XLOC_000100 0.2 50
condition1 gene2 XLOC_025437 0.018 100
期望的最终输出(见下面的更新):
condition3 gene3 XLOC_001045 0.0001 23
condition1 gene2 XLOC_025437 0.018 100
condition3 gene1 XLOC_000100 0.2 50
的更新:13年9月16日 的
我已经开始对这个问题给予赏金,因为答案(尽管很好)并没有达到我所希望的那样。如果需要澄清问题,请告诉我......
我的最终期望输出也略有变化: 如上所述,我想比较其中一个值的每个条件,并根据该值对基因进行排序。理想情况下,我想为每个已排序的基因输出每个条件(并在内部对相同的值进行排序):
condition3 gene3 XLOC_001045 0.0001 23 # Lowest q_value
condition1 gene3 XLOC_025437 0.018 100 # Other condition(s) for the gene with lowest q_value...
condition1 gene2 XLOC_025437 0.018 100 # For each gene, rank by q_value
condition3 gene2 XLOC_001000 0.02 43
condition2 gene2 XLOC_000322 0.22 77
condition3 gene1 XLOC_000100 0.2 50
condition2 gene1 XLOC_000347 0.5 80
答案 0 :(得分:2)
您的代码过于复杂。要获得具有两个或更多条件的基因列表,请使用grep:
my @genes = grep { keys %{$experiment{$_}} >= 2 } keys %experiment;
接下来,我们需要在最小q_value上对基因进行排序。最简单的方法(虽然目前还不是最有效的方法)是首先找到每个基因的最小值并将其填入哈希值:
use List::Util qw(min);
my %minimum;
foreach my $gene (@genes) {
my @q_vals;
push @q_vals, $experiment{$gene}{$_}[2] for keys %{$experiment{$gene}};
$minimum{$gene} = min @q_vals;
}
当我们得到所有最小值时,我们可以对它们进行排序:
@genes = sort { $minimum{$a} <=> $minimum{$b} } keys %minimum;
现在我们只需要对每个基因中的条件进行排序并打印出值:
foreach my $gene (@genes) {
# Sort conditions on the "2th" field (counting from 0)
my @conditions = sort { $experiment{$gene}{$a}[2] <=> $experiment{$gene}{$b}[2] } keys %{$experiment{$gene}};
foreach my $condition (@conditions) {
my ($xloc, $percentage, $q_val) = @{$experiment{$gene}{$condition}};
print "$condition\t$gene\t$xloc\t$q_val\t$percentage\n";
}
print "\n";
}
更新了输出:
condition3 gene3 XLOC_001045 0.0001 23
condition1 gene3 XLOC_025437 0.018 100
condition1 gene2 XLOC_025437 0.018 100
condition3 gene2 XLOC_001000 0.02 43
condition2 gene2 XLOC_000322 0.22 77
condition3 gene1 XLOC_000100 0.2 50
condition2 gene1 XLOC_000347 0.5 80
这不是一种非常有效的方法,因为我们多次遍历哈希。您可能需要考虑将数据结构更改为更易于管理的内容。
答案 1 :(得分:1)
构建另一个HoH而不是打印。最后重复一下,决定要打印什么。
所以在顶部,添加:
my %lowest;
my $sortkey='q_val';
在中间,进行此编辑:
foreach my $condition (@cond_name){
next unless exists $experiment{$gene}{$condition};
## ($xloc, $percentage, $q_val) = @{$experiment{$gene}{$condition}};
## print "$condition\t$gene\t$xloc\t$q_val\t$percentage\n";
my %cond;
@cond{ qw( xloc percentage q_val ) } = @{$experiment{$gene}{$condition}};
# >= may also be appropriate
if (!defined($lowest{$gene}) or $lowest{$gene}{$sortkey} > $cond{$sortkey}) {
@cond{ qw( condition gene ) } = ($condition, $gene); # useful at print time
$lowest{$gene} = \%cond
}
}
## print "\n";
最后:
# NB: <=> is for numeric comparison. Use cmp for non-numeric keys.
for my $gene (sort { $lowest{$a}{$sortkey} <=> $lowest{$b}{$sortkey} } keys %lowest) {
local ($, , $\)=("\t","\n");
print @{$lowest{$gene}}{qw( condition gene xloc q_val percentage )};
}
答案 2 :(得分:1)
我已经测试了这段代码,我尽可能简化了这段代码。我从你发布的原始代码开始,我希望我没有错过任何最初的更改。内联评论:
#!/usr/bin/perl
use warnings;
use strict;
my %experiment = (
'gene1' => {
'condition2' => ['XLOC_000347','80', '0.5'],
'condition3' => ['XLOC_000100', '50', '0.2']
},
'gene2' => {
'condition1' => ['XLOC_025437', '100', '0.018'],
'condition2' => ['XLOC_000322', '77', '0.22'],
'condition3' => ['XLOC_001000', '43', '0.02']
},
'gene3' => {
'condition1' => ['XLOC_025437', '100', '0.018'],
'condition3' => ['XLOC_001045', '23', '0.0001']
},
'gene4' => {
'condition3' => ['XLOC_091345', '93', '0.005']
}
);
我将您的一些my
声明移到了需要的位置。我还创建了qvals
哈希,我用它而不是你的overlap
哈希。它将包含每个基因的最小qval,以便于分类。
my (%qvals, %seen);
for my $gene (sort keys %experiment) {
for my $condition (sort keys %{$experiment{$gene}}) {
$seen{$gene}++; # Counts for each occurrence of gene
所以现在我们需要构建我们的qvals哈希。这将是我们打印输出的第一个(外部)排序键。每个基因的第一个条件,我们将保存该基因的qvalue。对于后续条件,如果我们找到较小的q值,我们会保存该值。
if ((not exists $qvals{$gene}) || # First time we've seen this gene, OR
($qvals{$gene} > $experiment{$gene}{$condition}[2])) { # Has a smaller q value
$qvals{$gene} = $experiment{$gene}{$condition}[2];
}
}
}
您可能不熟悉此排序语法。大括号{}
中的东西是“排序块”。通过在谷歌或Linux控制台中键入perldoc sort
,您可以学到更多东西 - 它可以让您做很复杂的事情,但是我们所使用的只是对数据以外的其他东西进行排序'排序。在这里,我们将基因(keys %qvals
)排在最小qvalue $qvals{$a}
上。 $a
会自动使用$b
和sort
,不要担心声明它们,<=>
是太空船运营商 - 它就像一个超级比较,返回0
1}}如果操作数相等,-1
如果左操作符较小,则+1
如果左操作符较大。基本上,如果您未指定排序块,则默认情况下排序将使用{$a <=> $b}
。
foreach my $gene (sort {$qvals{$a} <=> $qvals{$b}} keys %qvals) { # Sort the genes on ascending minimum q-val
if ($seen{$gene} == 1) {next;} # Skip gene if it only has one condition
foreach my $condition (sort # Sort conditions on ascending q-val
另一种复杂的排序,这次是针对每个基因的条件 - 我们对该基因(keys $experiment{$gene}
)的条件的q值进行排序($experiment{$gene}{$a}[2]
)。
{$experiment{$gene}{$a}[2] <=> $experiment{$gene}{$b}[2]}
keys $experiment{$gene} ) {
next unless exists $experiment{$gene}{$condition};
my ($xloc, $percentage, $q_val) = @{$experiment{$gene}{$condition}};
print "$condition\t$gene\t$xloc\t$q_val\t$percentage\n";
}
print "\n";
}
我得到以下输出:
condition3 gene3 XLOC_001045 0.0001 23
condition1 gene3 XLOC_025437 0.018 100
condition1 gene2 XLOC_025437 0.018 100
condition3 gene2 XLOC_001000 0.02 43
condition2 gene2 XLOC_000322 0.22 77
condition3 gene1 XLOC_000100 0.2 50
condition2 gene1 XLOC_000347 0.5 80
答案 3 :(得分:0)
use 5.10.0;
$, = "\t";
do {
my ( $gene, @conds ) = @$_;
say( ( $gene, @$_ )[ 1, 0, 2, 4, 3 ] ) for @conds;
say "";
}
for (
sort { $a->[1][3] <=> $b->[1][3] }
map {
my $conds = $experiment{$_};
[ $_ => sort { $a->[3] <=> $b->[3] }
map { [ $_, @{ $conds->{$_} } ] }
keys %$conds
]
} grep { keys %{ $experiment{$_} } > 1 } keys %experiment
);
将输出:
condition3 gene3 XLOC_001045 0.0001 23
condition1 gene3 XLOC_025437 0.018 100
condition1 gene2 XLOC_025437 0.018 100
condition3 gene2 XLOC_001000 0.02 43
condition2 gene2 XLOC_000322 0.22 77
condition3 gene1 XLOC_000100 0.2 50
condition2 gene1 XLOC_000347 0.5 80
或者
use 5.10.0;
$, = "\t";
say @$_[ 1, 0, 2, 4, 3 ]
for (
map {
my ( $gene, @conds ) = @$_;
map { [ $gene, @$_ ] } @conds
}
sort { $a->[1][3] <=> $b->[1][3] }
map {
my $conds = $experiment{$_};
[ $_ => sort { $a->[3] <=> $b->[3] }
map { [ $_, @{ $conds->{$_} } ] }
keys %$conds
]
} grep { keys %{ $experiment{$_} } > 1 } keys %experiment
);
将输出:
condition3 gene3 XLOC_001045 0.0001 23
condition1 gene3 XLOC_025437 0.018 100
condition1 gene2 XLOC_025437 0.018 100
condition3 gene2 XLOC_001000 0.02 43
condition2 gene2 XLOC_000322 0.22 77
condition3 gene1 XLOC_000100 0.2 50
condition2 gene1 XLOC_000347 0.5 80