减小数据大小以制作箱线图

时间:2018-02-15 21:07:59

标签: r perl statistics

我正在使用perl从数据文件中处理大型数组,然后将它们放入R中以制作箱图。这适用于小型数据集。但是,我想输入大量数据集给同事们。 我想要一种方法来减少数据点的数量,并保持箱线图看起来相同(相同的最大值,最小值,四分位数等),因为perl快速占用内存(每个数据集大约5GB的RAM),R也是如此我确信目前有120GB的问题有10MB的解决方案。

有没有办法减少数据大小,所以在我把它放入R之前它在boxplot 中看起来是一样的?我觉得很愚蠢,因为到目前为止我所做的只适用于如下的小数据集:

#!/usr/bin/env perl

use strict; use warnings; use Cwd 'getcwd';
my $TOP_DIRECTORY = getcwd();
local $SIG{__WARN__} = sub {#kill the program if there are any warnings
    my $message = shift;
    my $fail_filename = "$TOP_DIRECTORY/$0.FAIL";
    open my $fh, '>', $fail_filename or die "Can't write $fail_filename: $!";
    printf $fh ("$message @ %s\n", getcwd());
    close $fh;
    die "$message\n";
};#http://perlmaven.com/how-to-capture-and-save-warnings-in-perl

sub reduce_size {#returns min, lower quartile, median, upper quartile, and max if > 5 elements
# R is not able to handle large arrays as Perl is.  This subroutine reduces the work on R, which can't handle the size
#this does NOT work for means
    my $x = shift;
#   if (scalar @{ $x } < 6) {
#       return @{ $x };
#   }
    my @sorted = sort {$a <=> $b} @{ $x };
    my $n = scalar @{ $x };
    my @return = $sorted[0];#minimum
    if (($n % 2) == 0) {
        $return[1] = $sorted[$n/4 ] ;
        $return[2] = ($sorted[$n/2 - 1] + $sorted[$n/2])/2;
        $return[3] = $sorted[3*$n/4 ] ;
    } else {
        my $i = sprintf("%.0f", $n/4);
        $return[1] = ($sorted[ $i - 1] + $sorted[$i])/2;
        $return[2] = $sorted[$n/2];
        $i = sprintf("%.0f", 3*$n/4);
        $return[3] = ($sorted[$i-1] + $sorted[$i])/2;
    }
    push @return, $sorted[-1];#maximum
    return @return;
}

my @x = qw(-0.82312297 -0.08696213  2.23698132  0.20834949  0.73162884  0.22891093 1.04418464 -0.67952858  0.08111757);

my @y = reduce_size(\@x);

print join (',', @y) . "\n";

1 个答案:

答案 0 :(得分:0)

这里的关键是翻译R的fivenum功能:

#!/usr/bin/env perl

use strict; use warnings; use Cwd 'getcwd'; use feature 'say';
my $TOP_DIRECTORY = getcwd();
local $SIG{__WARN__} = sub {#kill the program if there are any warnings
    my $message = shift;
    my $fail_filename = "$TOP_DIRECTORY/$0.FAIL";
    open my $fh, '>', $fail_filename or die "Can't write $fail_filename: $!";
    printf $fh ("$message @ %s\n", getcwd());
    close $fh;
    die "$message\n";
};#http://perlmaven.com/how-to-capture-and-save-warnings-in-perl

use POSIX qw(ceil floor);

sub fivenum {
    my $array = shift;
    my @x = sort {$a <=> $b} @{ $array };
    printf("There are %u elements.\n", scalar @{ $array });
    my $n = scalar @{ $array };
    if ($n == 0) {
        print "no values were entered into fivenum.\n";
        die;
    }
    my $n4 = floor(($n+3)/2)/2;
    my @d = (1, $n4, ($n +1)/2, $n+1-$n4, $n);#d <- c(1, n4, (n + 1)/2, n + 1 - n4, n)
    my (@floor_d, @ceiling_d);
    foreach my $d (0..4) {
          $floor_d[$d] = floor($d[$d]);
        $ceiling_d[$d] =  ceil($d[$d]);
    }
    my @sum_array;
    foreach my $e (0..4) {
        if (not defined $floor_d[$e]) {
            say "\$floor_d[$e] isn't defined.";
            die;
        }
        if (not defined $ceiling_d[$e]) {
            say "\$ceiling_d[$e] isn't defined.";
            die;
        }
        if (!defined $x[$floor_d[$e]-1]) {
            say "\$x[$floor_d[$e-1]-1] isn't defined.";
            die;
        }
        if (!defined $x[$ceiling_d[$e]-1]) {
            say "\$x[$ceiling_d[$e]-1] isn't defined.";
            die;
        }
        printf("sum_array has %u elements.\n", scalar @sum_array);
        push @sum_array, (0.5 * ($x[$floor_d[$e]-1] + $x[$ceiling_d[$e]-1]));
    }
    return @sum_array;
}

my @x = qw(0.14082834  0.09748790  1.73131507  0.87636009 -1.95059594  0.73438555
-0.03035726  1.46675970 -0.74621349 -0.72588772  0.63905160  0.61501527
 -0.98983780 -1.00447874 -0.62759469  0.66206163  1.04312009 -0.10305385
  0.75775634  0.32566578);

my @y = fivenum(\@x);

say join (',', @y);