从值表中对缺失数据进行线性插值

时间:2013-01-11 05:01:49

标签: arrays perl loops interpolation subroutine

以下是数据文件的摘录,我的出发点:

Marker      Distance_1  Distance_2  ID
.
.
.
30          13387412    34.80391242 seq-SN_FIRST
31          13387444    34.80391444 seq-SN_Second
31.1             
31.2             
32          13387555    39.80391    seq-SN_Third
.
.
.

这是一个制表符分隔的文件,每行包含四行元素。第一行是标题。在那之后,有大量的数据行。垂直点实际上并不在真实文件中,但它们只是表示类似于显示的实际行的数据出现在明确显示的示例行之前和之后。

某些数据行是“完整的”,即所有四个单元格条目都包含一些内容。其他行是“空白”,只有第一个实际条目,但后跟3个制表符分隔的单个空格。空白行中的那些空格需要“填写”。填充将通过线性插值完成,使用前一行和后一行的相应单元条目。例如,第2列中的缺失Distance_1 values将使用前一行的值13387444和后一行的值13387555进行插值。对于第3列值也是如此。这里只忽略第4列值。

该脚本的第一个目标是识别需要填充的数据块及其侧面的“完整”行。空行将包含3个标签式单个空格,并且将以这种方式进行ID。找到后,连续的空白行和侧面实线将被发送到子程序进行插值。

#!/usr/bin/perl
use strict;
use warnings;

die "usage: [ map positions file post SAS ]\n\n" unless @ARGV == 1;

my @file = ();

while (my $line = <$ARGV[0]>) {
  chomp $line;
  push(@file, $line);
}

my @consecutive_lines = (); # array collects a current set of consecutive lines requiring linear interpolation
my @filled = ();    # my final array, hopefully correctly filled in

#####
# search for consecutive set of lines in @file
#####

for (my $i = 0; $i < $#file; $i++) {          #  $#file returns the index of the last element in @file

  if ($file[$i] !~ /(?:\t\s){3}/) {           # if not a blank line
                                              # but a "full line"
    push(@filled, $file[$i]);                 # push the header and full lines, until...
  }

  elsif ($file[$i] =~ /(?:\t\s){3}/) {        # ...a blank line is found

    push (@consecutive_lines, $file[$i - 1]); # push preceding full line into array

    while ($file[$i] =~ /(?:\t\s){3}/ and $i < $#file) {  # keep pushing lines, so long as they are blank
                                                          # or end of file
      push(@consecutive_lines, $file[$i++]);
    }

    push(@consecutive_lines, $file[$i]) ;     # else we reach next full line, so push it into array

    my @fillme = linearInterpolation(@consecutive_lines); # send set of lines out for filling

    push(@filled, @fillme);                   # push filled in set of lines into the final array

    @consecutive_lines = ();                  # reset or undef array @consecutive_lines for next cycle

  }    # end of elsif

}    # end of for loop

感谢用户@Kenosis对上述内容的大量帮助,我已修改过(希望不会被修改)。

接下来是线性插值。我正在尝试将脚本的第一阶段链接到第二阶段。到目前为止,它运作不佳。

我的目标是将数组@incoming交给子程序。然后将该数组拆分,以便实际的单元条目是“可见的”并且可以由数组索引,并且因此被调用。我一直试图弄清楚如何为Distance_1的第2列值首先执行此操作。我觉得这个脚本很接近,并且在计算插值后立即开始偏离。

#####
# subroutine linear interpolation
#####

sub linearInterpolation {
  my @incoming = @_;    # array of consecutive set of lines

  my @splitup;                  # declare new array, will be a "split up" version of @incoming
  my ($A, $B, $C, $D, $E);      # variables for linear interpolation
  my @fillme;                   # declaring the "emtpy" array to be filled in
  my @dist_1_fills;             # array of interpolated values for dist_1

  for (my $i = 0;
    $i < scalar @incoming; $i++)     # loop to split up lines of @incoming
  {                                  # into indexed cell entries
    chomp $incoming[$i];             # and make new array of them
    my @entries = split('\t', $incoming[$i]);
    push(@splitup, @entries);
  }

  $A = $splitup[1];                   # cell entry in column 2 of preceding full line
  $B = $splitup[-3];                  # cell entry in column 2 of succeeding full line

  $C = $splitup[2];                   # cell entry in column 3 of preceding full line
  $D = $splitup[-2];                  # cell entry in column 3 of succeeding full line
  $E = scalar @incoming - 1;          # equals number of lines in the set minus 1

  for (my $i = 1; $i < $E; $i++) {    # need to start finding appropriate
                                      # number interpolated values, given number of
    my @dist_1_fills =
        interpvalues($A, $B, $E, $i); # of lines in consecutive set of lines

    for ($i = 0; $i < scalar @splitup; $i += 4) {
      push(@fillme, $splitup[$i], $dist_1_fills[$i], "dist_2_fills", "--");
                                      # fourth column values will be ignored or filled with --.
                                      # "dist_2_fills" just occupying it's proper spot until I can figure out distance 1 fills
    }
  }
}

#########

sub interpvalues {                  # subroutine to find interpolated values
  my ($A, $B, $E, $i) = @_;
  my $dist_1_answers = (($B - $A) / ($E)) * $i + $A;
  return $dist_1_answers;
}

代码在第二部分中感到困惑,第二部分涉及查找内插值并将它们发送回代码的第一部分以最终填充数据集。我认为特别是我最大的(尽管可能不是我唯一的)问题是在第二个子程序中计算出它们后,尝试用适当的值填充空白行。

非常感谢任何提示和线索!

2 个答案:

答案 0 :(得分:1)

该程序将满足您的需求。它期望输入文件名作为命令行上的参数。

use strict;
use warnings;

my @saved;
my @needed;

while (<>) {
  chomp;
  my @fields = split /\t/;

  # Pass hrough headers and junk
  unless ($fields[0] and $fields[0] =~ /\d/) {
    print "$_\n";
    next;
  }

  # Save x-value for records without a y-value
  if ($fields[1] !~ /\d/) {
    push @needed, $fields[0];
    next;
  }

  # We have a filled-out row. Calculate any intermediate missing ones
  if (@needed) {
    if ($saved[0] == $fields[0]) {
      die sprintf qq(Duplicate marker values %.1f at line %d of "%s"\n), $saved[0], $., $ARGV;
    }
    my ($a1, $b1) = solve_linear(@saved[0,1], @fields[0,1]);
    my ($a2, $b2) = solve_linear(@saved[0,2], @fields[0,2]);
    while (@needed) {
      my $x = shift @needed;
      my $y1 = $a1 * $x + $b1;
      my $y2 = $a2 * $x + $b2;
      print join("\t", $x, $y1, $y2), "\n";
    }
  }

  print "$_\n";
  @saved = @fields;
}

sub solve_linear {
  my ($x0, $y0, $x1, $y1) = @_;
  my ($dx, $dy) = ($x1 - $x0, $y1 - $y0);
  my $aa = $dy / $dx;
  my $bb = ($y0 * $dx - $x0 * $dy)  / $dx;
  return ($aa, $bb);
}

<强>输出

Marker  Distance_1  Distance_2  ID
.
.
.
30  13387412  34.80391242 seq-SN_FIRST
31  13387444  34.80391444 seq-SN_Second
31.1  13387455.1  35.303913996  --
31.2  13387466.2  35.803913552  --
32  13387555  39.80391  seq-SN_Third
.
.
.
Tool completed successfully

答案 1 :(得分:0)

我将代码修改为此,因此线性插值不是基于第一列中的值,而是基于第二列和第三列中的值。特别感谢用户@Kenosis和@Borodin。我接受了Kenosis对前一个问题的回答,我已经接受了Borodin,尽管我在“回答你自己的问题”一节中发布了这个修订版。这里发布的修订是否可以接受?我浏览了常见问题解答,但还没有发现任何相关内容。

#!/usr/bin/perl
use strict; use warnings;

my @saved;
my @needed;

while (<>) {
  chomp;
  my @fields = split /\t/;

    # Does the current line actually exist AND does it contain one or more digits.
    unless ($fields[0] and $fields[0] =~ /\d/) {
    # If no, this is the header, so print it. If yes, advance.
    print "$_\n";
    #after printing header, go back to <> and read in next line.
    next;
  }

  # Is the second cell of the current line devoid of digits?
  if ($fields[1] !~ /\d/) {                
  # If no, advance. If yes, remember $field[0], the Marker.
  push @needed, $fields[0];              
  # After pushing, go back to <> and read in next line.
  next;
 }

  # If we are here, we must have a filled-out row.
  # Does @needed have any values? If no, advance. If yes,
  if (@needed) {
    if ($saved[0] == $fields[0]) {
      die sprintf qq(Duplicate marker values %.1f at line %d of "%s"\n), $saved[0], $., $ARGV;
     }
    # Else send preceding dist_1 value, succeeding dist_1 value,
    # preceding dist_2 value, succeeding dist_2 value, 
    # and number of emtpy lines to subroutine.
    my ($dist_1_interval, $dist_2_interval) = interval_sizes($saved[1], $fields[1], $saved[2],   $fields[2], scalar @needed);     
    # Current size of @needed is saved as $size and is used to help with iteration.
# So long as @needed contains values...
my $size = scalar @needed;
while (@needed) {
  # ...remove left-most Marker value from array @needed.
  my $x = shift @needed;
  # Interpolated values for dist_1 and dist_2 are
  # (respective interval size x iteration of while loop) + preceding values.
  my $new_dist_1 = ($dist_1_interval * (1 + ($size - (scalar @needed + 1)))) + $saved[1];
  my $new_dist_2 = ($dist_2_interval * (1 + ($size - (scalar @needed + 1)))) + $saved[2];
  print join("\t", $x, $new_dist_1, $new_dist_2, "--"), "\n";
  }
}     
      # We are here since current line is already a filled-in row.
      print "$_\n";
      # Print this row and assign it to @saved. Return to <>.
      @saved = @fields;
} 

sub interval_sizes {
  # $A = preceding dist_1, $B = succeeding dist_1, 
  # $C = preceding dist_2, $D = succeeding dist_2,
  # $E = number of needed distances.
  my ($A, $B, $C, $D, $E) = @_; 
  # I need an interval size for dist_1 based on difference between $B and $A.
  my $dist_1_interval = ($B - $A)/($E + 1);
  # I need an interval size for dist_2 based on difference between $D and $C.
  my $dist_2_interval = ($D - $C)/($E + 1);
  return ($dist_1_interval, $dist_2_interval);
 }