这是我生成的制表符分隔文件的一部分:
Sample Gene RPKM
MK_001_27 HPSE2 17.3767266340978
MK_003_11 HPSE2 51.1152373602497
MK_002_5 HPSE2 16.5372913024845
MK_001_23 HPSE2 25.8985857481585
MK_001_23 HPSE2 21.6045197173757
MK_001_27 HPSE2 139.450963428357
MK_001_23 HPSE2 36.7603351866179
MK_003_9 HPSE2 25.2860867858956
MK_001_22 HPSE2 100.915250867745
MK_003_9 HPSE2 35.078964327254
MK_003_12 HPSE2 34.078813048573
MK_003_9 HPSE2 13.5865540939141
任何基因都存在于至少一个样本中。它可能在同一样本中多次出现。
我想生成每百万每千里读取数(RPKM)值的平均值,并使用它来替换同一样本中任何给定基因的多个值
例如
MK_003_9 HPSE2 35.078964327254
MK_003_9 HPSE2 13.5865540939141
MK_003_9 HPSE2 25.2860867858956
变为
MK_003_9 HPSE2 24.650535069
有一种方法可以解决这个问题,可以在Unix或Perl中解决这个问题吗?
伪代码
-For each line, combine column 0 (sample) and column 1 (gene) to create a "key"
though not always unique
-Run through the file to see if this "key" is present somewhere else
-Count the number of times this "key" is present
-If present > 1 time, calculate the mean of the RPKM values by sum()/count
-Create occurrence with this "key" and the new RPKM value
-Delete(?) the other corresponding "keys"
答案 0 :(得分:3)
如果将数据文件的内容累积到Perl哈希中,这非常简单。你不清楚单个文件中是否有多个基因,所以我为多个样本和多个基因编码了这个
程序期望输入文件作为命令行上的参数,并将其输出打印到STDOUT,可以正常重定向
use strict;
use warnings 'all';
use List::Util 'sum';
print scalar <>; # Copy the header line
my %mean_rpkm;
while ( <> ) {
my ($sample, $gene, $rpkm) = split;
push @{ $mean_rpkm{$gene}{$sample} }, $rpkm;
}
for my $gene ( sort keys %mean_rpkm ) {
for my $sample ( sort keys %{ $mean_rpkm{$gene} } ) {
my $rpkm = $mean_rpkm{$gene}{$sample};
my $mean = sum(@$rpkm) / @$rpkm;
printf "%s\t%s\t%.3f\n", $sample, $gene, $mean;
}
}
Sample Gene RPKM
MK_001_22 HPSE2 100.915
MK_001_23 HPSE2 28.088
MK_001_27 HPSE2 78.414
MK_002_5 HPSE2 16.537
MK_003_11 HPSE2 51.115
MK_003_12 HPSE2 34.079
MK_003_9 HPSE2 24.651
我的解决方案的输出基本上是无序的。我已经对词汇中的基因和样本进行了排序,但您可能希望它们的顺序与输入文件中出现的顺序相同。如果是这样,你应该这么说
一个有用的中间解决方案是安装Sort::Naturally
(它不是核心模块)并添加
use Sort::Naturally 'nsort';
到上面的程序的顶部。如果您随后将sort
的两个匹配项替换为nsort
,那么您将获得此输出。它可能不是主意,但它是一种改进,因为它在 MK_003_9
之前对MK_003_11
进行排序,这是一个简单的词法排序无法做到的
Sample Gene RPKM
MK_001_22 HPSE2 100.915
MK_001_23 HPSE2 28.088
MK_001_27 HPSE2 78.414
MK_002_5 HPSE2 16.537
MK_003_9 HPSE2 24.651
MK_003_11 HPSE2 51.115
MK_003_12 HPSE2 34.079
答案 1 :(得分:1)
使用List::Util获取sum
功能。将RPKM存储在数组的散列中,关键是样本。最后,对数组求和并除以它们的元素数:
perl -MList::Util=sum -lane '
next if 1 == $.;
push @{ $h{ $F[0] } }, $F[2];
}{
print $_, "\t", sum(@{ $h{$_} }) / @{ $h{$_} }
for sort keys %h;
' < input-file > output-file
-n
逐行读取输入。-a
将空格上的每一行拆分为@F数组。-l
为打印添加换行符。$.
是行号,标题行等于1。答案 2 :(得分:1)
这里的核心问题是 - 如果你有一个像'如果有多个,删除重复'的条件 - 你不知道这个条件是否适用,直到你解析整个文件。
你可以全部阅读,做一些计算(幂等),然后转储一些输出。有点像这样:
#!/usr/bin/perl
use strict;
use warnings;
my %stuff;
#iterate line by line of the special 'DATA' filehandle.
#(You probably want <> instead)
while (<DATA>) {
#split on any whitespace.
my ( $sample, $gene, $RPKM ) = split;
#stuff the values into a list.
push( @{ $stuff{$sample}{$gene} }, $RPKM );
}
#iterate processed results and print them
foreach my $sample ( sort keys %stuff ) {
foreach my $gene ( sort keys %{ $stuff{$sample} } ) {
#sum and divide for average.
my $sum = 0;
$sum += $_ for @{ $stuff{$sample}{$gene} };
print join "\t", $sample, $gene, $sum / @{ $stuff{$sample}{$gene} },
"\n";
}
}
__DATA__
MK_001_27 HPSE2 17.3767266340978
MK_003_11 HPSE2 51.1152373602497
MK_002_5 HPSE2 16.5372913024845
MK_001_23 HPSE2 25.8985857481585
MK_001_23 HPSE2 21.6045197173757
MK_001_27 HPSE2 139.450963428357
MK_001_23 HPSE2 36.7603351866179
MK_003_9 HPSE2 25.2860867858956
MK_001_22 HPSE2 100.915250867745
MK_003_9 HPSE2 35.078964327254
MK_003_12 HPSE2 34.078813048573
MK_003_9 HPSE2 13.5865540939141
这给出了:
MK_001_22 HPSE2 100.915250867745
MK_001_23 HPSE2 28.0878135507174
MK_001_27 HPSE2 78.4138450312274
MK_002_5 HPSE2 16.5372913024845
MK_003_11 HPSE2 51.1152373602497
MK_003_12 HPSE2 34.078813048573
MK_003_9 HPSE2 24.6505350690212
注意 - 它按字母数字排序,因为哈希本质上是无序的。如果您需要维护订购,它会更复杂,但并非不可能。