awk打印缺失的序列间隙和最小值 - 最大值:

时间:2014-05-15 14:30:31

标签: perl bash unix awk

想要从第一列打印缺失的序列缺口(开始缺失序列,结束缺失序列) 然后需要打印Minimum&第一列的最大序列 以及$ 2,substr($ 3,4,6),substr($ 4,4,6),$ 6,$ 8,$ 10字段的组合。 输入文件未按第一列排序。

Input.csv

21,abc,22-JUN-12.08:06:03,22-JUN-12.08:06:03,19-Apr-16,1,INR,RO0412,RC03,L7,,31
22,abc,22-JUN-12.08:06:03,22-JUN-12.08:06:03,19-Apr-16,1,INR,RO0412,RC03,L7,,31
23,abc,22-JUN-12.08:06:03,22-JUN-12.08:06:03,19-Apr-16,1,INR,RO0412,RC03,L7,,31
24,abc,30-JUN-12.01:06:49,30-JUN-12.01:06:49,19-Apr-16,1,INR,RO0412,RC03,L7,,29
28,abc,30-JUN-12.01:06:49,30-JUN-12.01:06:49,19-Apr-16,1,INR,RO0412,RC03,L7,,29
32,abc,29-MAY-13.12:05:11,29-MAY-13.12:05:11,15-Feb-17,1350,INR,RO0213,CD,K1,,30
38,abc,29-MAY-13.12:05:11,29-MAY-13.12:05:11,15-Feb-17,1350,INR,RO0213,CD,K1,,30
41,abc,20-FEB-14.11:02:37,20-FEB-14.11:02:37,31-Dec-20,650,INR,EN1113,ch650,S317,,28
46,abc,20-FEB-14.11:02:37,20-FEB-14.11:02:37,31-Dec-20,650,INR,EN1113,ch650,S317,,28
51,abc,20-FEB-14.11:02:37,20-FEB-14.11:02:37,31-Dec-20,650,INR,EN1113,ch650,S317,,28
52,abc,20-FEB-14.11:02:37,20-FEB-14.11:02:37,31-Dec-20,650,INR,EN1113,ch650,S317,,28

尝试过此命令并得到部分输出:

cat Input.csv | \
awk -F, '{OFS=","; print $1,$2,substr($3,4,6),substr($4,4,6),$6,$8,$10}' | \
sort -k1 -t, | \
awk -F, 'BEGIN {OFS=","} (($1!=p+1) && ($7==p7)) {print p,p2,p3,p4,p5,p6,p7,p+1 "," $1-1,$1} {p=$1;p2=$2;p3=$3;p4=$4;p5=$5;p6=$6;p7=$7}'

以上命令输出标题名称为:

Minimum Seq ($1),$2,substr($3,4,6),substr($4,4,6),$6,$8,$10,start Missing Seq ($1),End Missing Seq ($1),Maximum Seq ($1)

24,abc,JUN-12,JUN-12,1,RO0412,L7,25,27,28
32,abc,MAY-13,MAY-13,1350,RO0213,K1,33,37,38
41,abc,FEB-14,FEB-14,650,EN1113,S317,42,45,46
46,abc,FEB-14,FEB-14,650,EN1113,S317,47,50,51

在上面的输出中 - 最小Seq($ 1),最大Seq($ 1)值不正确我预期的结果,请帮助... 例如,打印输出中的第一行 - 最小seq应为21而不是24                打印输出中的第三行 - 最大seq应为52而不是46

期望的输出:

## $2,$3,$4,$6,$8,$10,"start Missing Seq ($1), ",End Missing Seq ($1) ,Minimum Seq ($1),Maximum Seq ($1) ##

abc,JUN-12,JUN-12,1,ROTN0412,L7,25,27,21,28
abc,MAY-13,MAY-13,1350,ROTN0213,K1,33,37,32,38
abc,FEB-14,FEB-14,650,CHEN1113,S317,42,45,41,52
abc,FEB-14,FEB-14,650,CHEN1113,S317,47,50,41,52

1 个答案:

答案 0 :(得分:0)

您可以尝试以下perl脚本:

#! /usr/bin/perl

use warnings;
use strict;
use File::Slurp qw(read_file);
use List::Util qw(min max);

my @lines=read_file('input.csv');

my $ll=sortLines(\@lines);

$ll=reduceFields($ll);

my $rr=findRanges($ll);

printMissingSeqs($rr,$ll);


sub printMissingSeqs { 
  my ($rr,$ll) = @_;

  my $pkey=""; my $pss; my $i=0; 
  for (@$ll) {
     my @f=split(/,/);
     my $key=$f[6];
     my $ss=$f[0];
     $pss=$ss if $i==0;
     if (($key eq $pkey) && ($ss-$pss)>1) {
        print join(",",(@f[1..6], $pss+1,$ss-1,@{$rr->{$key}}))."\n";
     }
     $pkey=$key; $pss=$ss;
     $i++;
  }
}

sub findRanges { 
  my ($ll) = @_;

  my %temp;
  my %rr;

  for (@$ll) {
     my @f=split(/,/);
     push (@{$temp{$f[6]}},$f[0]);
  }

  for (keys %temp) {
     my $min=min(@{$temp{$_}});
     my $max=max(@{$temp{$_}});
     $rr{$_}=[$min, $max];
  }
  return \%rr;
}

sub reduceFields { 
  my ($ll) = @_;

  my @a;
  for (@$ll) {
     my @f=split(/,/);
     my $line=join(",",($f[0],$f[1],substr($f[2],3,6),substr($f[3],3,6),$f[5],$f[7],$f[9]));
     push (@a,$line);
  }
  return \@a;
}


sub sortLines { 
  my ($lines) = @_;

  my @a=sort { my ($keyA)=$a=~/(.*?),/; my ($keyB)=$b=~/(.*?),/; $keyA<=>$keyB} @$lines;

  return \@a;
}

输出:

abc,JUN-12,JUN-12,1,RO0412,L7,25,27,21,28
abc,MAY-13,MAY-13,1350,RO0213,K1,33,37,32,38
abc,FEB-14,FEB-14,650,EN1113,S317,42,45,41,52
abc,FEB-14,FEB-14,650,EN1113,S317,47,50,41,52