我正在使用perl从数据文件中处理大型数组,然后将它们放入R中以制作箱图。这适用于小型数据集。但是,我想输入大量数据集给同事们。 我想要一种方法来减少数据点的数量,并保持箱线图看起来相同(相同的最大值,最小值,四分位数等),因为perl快速占用内存(每个数据集大约5GB的RAM),R也是如此我确信目前有120GB的问题有10MB的解决方案。
有没有办法减少数据大小,所以在我把它放入R之前它在boxplot 中看起来是一样的?我觉得很愚蠢,因为到目前为止我所做的只适用于如下的小数据集:
#!/usr/bin/env perl
use strict; use warnings; use Cwd 'getcwd';
my $TOP_DIRECTORY = getcwd();
local $SIG{__WARN__} = sub {#kill the program if there are any warnings
my $message = shift;
my $fail_filename = "$TOP_DIRECTORY/$0.FAIL";
open my $fh, '>', $fail_filename or die "Can't write $fail_filename: $!";
printf $fh ("$message @ %s\n", getcwd());
close $fh;
die "$message\n";
};#http://perlmaven.com/how-to-capture-and-save-warnings-in-perl
sub reduce_size {#returns min, lower quartile, median, upper quartile, and max if > 5 elements
# R is not able to handle large arrays as Perl is. This subroutine reduces the work on R, which can't handle the size
#this does NOT work for means
my $x = shift;
# if (scalar @{ $x } < 6) {
# return @{ $x };
# }
my @sorted = sort {$a <=> $b} @{ $x };
my $n = scalar @{ $x };
my @return = $sorted[0];#minimum
if (($n % 2) == 0) {
$return[1] = $sorted[$n/4 ] ;
$return[2] = ($sorted[$n/2 - 1] + $sorted[$n/2])/2;
$return[3] = $sorted[3*$n/4 ] ;
} else {
my $i = sprintf("%.0f", $n/4);
$return[1] = ($sorted[ $i - 1] + $sorted[$i])/2;
$return[2] = $sorted[$n/2];
$i = sprintf("%.0f", 3*$n/4);
$return[3] = ($sorted[$i-1] + $sorted[$i])/2;
}
push @return, $sorted[-1];#maximum
return @return;
}
my @x = qw(-0.82312297 -0.08696213 2.23698132 0.20834949 0.73162884 0.22891093 1.04418464 -0.67952858 0.08111757);
my @y = reduce_size(\@x);
print join (',', @y) . "\n";
答案 0 :(得分:0)
这里的关键是翻译R的fivenum
功能:
#!/usr/bin/env perl
use strict; use warnings; use Cwd 'getcwd'; use feature 'say';
my $TOP_DIRECTORY = getcwd();
local $SIG{__WARN__} = sub {#kill the program if there are any warnings
my $message = shift;
my $fail_filename = "$TOP_DIRECTORY/$0.FAIL";
open my $fh, '>', $fail_filename or die "Can't write $fail_filename: $!";
printf $fh ("$message @ %s\n", getcwd());
close $fh;
die "$message\n";
};#http://perlmaven.com/how-to-capture-and-save-warnings-in-perl
use POSIX qw(ceil floor);
sub fivenum {
my $array = shift;
my @x = sort {$a <=> $b} @{ $array };
printf("There are %u elements.\n", scalar @{ $array });
my $n = scalar @{ $array };
if ($n == 0) {
print "no values were entered into fivenum.\n";
die;
}
my $n4 = floor(($n+3)/2)/2;
my @d = (1, $n4, ($n +1)/2, $n+1-$n4, $n);#d <- c(1, n4, (n + 1)/2, n + 1 - n4, n)
my (@floor_d, @ceiling_d);
foreach my $d (0..4) {
$floor_d[$d] = floor($d[$d]);
$ceiling_d[$d] = ceil($d[$d]);
}
my @sum_array;
foreach my $e (0..4) {
if (not defined $floor_d[$e]) {
say "\$floor_d[$e] isn't defined.";
die;
}
if (not defined $ceiling_d[$e]) {
say "\$ceiling_d[$e] isn't defined.";
die;
}
if (!defined $x[$floor_d[$e]-1]) {
say "\$x[$floor_d[$e-1]-1] isn't defined.";
die;
}
if (!defined $x[$ceiling_d[$e]-1]) {
say "\$x[$ceiling_d[$e]-1] isn't defined.";
die;
}
printf("sum_array has %u elements.\n", scalar @sum_array);
push @sum_array, (0.5 * ($x[$floor_d[$e]-1] + $x[$ceiling_d[$e]-1]));
}
return @sum_array;
}
my @x = qw(0.14082834 0.09748790 1.73131507 0.87636009 -1.95059594 0.73438555
-0.03035726 1.46675970 -0.74621349 -0.72588772 0.63905160 0.61501527
-0.98983780 -1.00447874 -0.62759469 0.66206163 1.04312009 -0.10305385
0.75775634 0.32566578);
my @y = fivenum(\@x);
say join (',', @y);