如何对文本文件中的重复列进行分组 - perl(组,子组)

时间:2017-07-27 23:36:28

标签: perl multidimensional-array grouping multiple-columns

输入描述

这是一个制表符分隔文件。前5列是ID及其关系,无论如何都必须在最终打印中。如果重复,我想将列从6分组。存在不同的组,如层次结构第3列是分组的标准。在这个例子中,4和5是标准。请检查输出。

输入

AG1_4099    13  4   2   2   UV1040  UV0000  UV3770  UV3890
AG1_9001    20  4   2   1   UV1040  UV0000  UV3770  UV3890
AG1_9011    63  4   2   4   UV1040  UV0000  UV3770  UV3890
AG1_7013    11  4   1   1   UV1040  UV0000  UV3770  UV3890
AG1_9010    37  4   1   1   UV1040  UV0000  UV3770  UV3890
AG1_1011    33  4   2   7   UV1040  UV2080  UV3770  UV3890
AG1_1013    101 4   1   1   UV1040  UV2080  UV3770  UV3890
AG1_0001    7   4   2   1   UV1040  UV2100  UV3770  UV3890
AG1_1010    23  4   1   1   UV1040  UV8000  UV3770  UV3890
AG1_2099    13  4   2   2   UV1040  UV1000  UV3770  UV3890
AG1_3133    24  5   2   2   UV1040  UV300   UV2100  UV3770  UV3890
AG1_3433    343 5   7   3   UV1040  UV2118  UV2100  UV3890  UV3770
AG1_1100    254 5   1   4   UV2100  UV3770  UV3890  UV2105  MK7
AG1_8111    3   5   3   2   UV1040  UV3770  UV3890  UV2100  MK1
AG1_3430    84  5   2   2   UV1040  UV3770  UV3890  UV2100  MK1
AG1_7700    87  5   3   2   UV1040  UV3770  UV3890  UV2100  MK1
....
....
and so on

期望的输出

(1) #### Criteria 4 grouped 
AG1_4099    13  4   2   2   UV1040  UV0000  UV3770  UV3890
AG1_9001    20  4   2   1   UV1040  UV0000  UV3770  UV3890
AG1_9011    63  4   2   4   UV1040  UV0000  UV3770  UV3890
AG1_7013    11  4   1   1   UV1040  UV0000  UV3770  UV3890
AG1_9010    37  4   1   1   UV1040  UV0000  UV3770  UV3890

AG1_1011    33  4   2   7   UV1040  UV2080  UV3770  UV3890
AG1_1013    101 4   1   1   UV1040  UV2080  UV3770  UV3890

AG1_0001    7   4   2   1   UV1040  UV3770  UV3890
AG1_1010    23  4   1   1   UV1040  UV3770  UV3890
AG1_2099    13  4   2   2   UV1040  UV3770  UV3890

#### Singles
AG1_0001    7   4   2   1   UV2100
AG1_1010    23  4   1   1   UV8000
AG1_2099    13  4   2   2   UV1000

(2) #### Criteria 5 is grouped
AG1_8111    3   5   3   2   UV1040  UV2100  UV3770  UV3890  MK1
AG1_3430    84  5   2   2   UV1040  UV2100  UV3770  UV3890  MK1
AG1_7700    87  5   3   2   UV1040  UV2100  UV3770  UV3890  MK1

AG1_3133    24  5   2   2   UV1040  UV2100  UV3770  UV3890
AG1_3433    343 5   7   3   UV1040  UV2100  UV3770  UV3890

AG1_1100    254 5   1   4   UV2100  UV3770  UV3890

#### Singles
AG1_1100    254 5   1   4   UV2105  MK7
AG1_3133    24  5   2   2   UV300
AG1_3433    343 5   7   3   UV2118

代码

在最里面的for循环中匹配之后,我无法形成组。请帮助并纠正。

use strict;
use warnings;

#my $in = $ARGV[0]; chomp $in;
my $in = "test.txt";

open(IN,"$in") or die "Unable to open the $in:$!\n";

my @multiArr = ();

while(my $line = <IN>)
{
    chomp $line;
    my @lineArr = split(/\t/, $line);
    push (@multiArr, \@lineArr);
}
close IN;

my ($quant,$lineCnt) = "";
my @gnmArr = ();
my $count = 0;
my $tmpquant = "";
my @tmpMulti = ();
my $out = "";

LOOP: for (my $i = 5; $i < 6; $i++)
{


    for (my $line = 0; $line < scalar @multiArr ; $line++)
    { 
        $quant = $multiArr[$line][2];
        $lineCnt = scalar @{$multiArr[$line]}-1;

        if($i == $quant)
        {
            $count++;
            push (@tmpMulti, \@{$multiArr[$line]});
            $tmpquant = $multiArr[$line][2];
        }
    } 

    my $c = "";
    for (my $cls = 0; $cls < scalar @tmpMulti ; $cls++)
    {
        print "start";  
        for (my $move = scalar @tmpMulti-1; $move > $cls; $move--)
        {
#print "$move\t$cls\n";
            for (my $gnm = 5; $gnm < scalar @{$tmpMulti[$cls]}; $gnm++)
            {
#print "$gnm\n";
                for (my $g = 5; $g < scalar @{$tmpMulti[$move]}; $g++)
                {

                    if($tmpMulti[$cls][$gnm] eq $tmpMulti[$move][$g]){
#print "$cls\t$gnm\n$move\t$gnm\n";
                        print "$tmpMulti[$cls][$gnm]\n";
                        $c++;
                    }
                }
            }
#print "$move\t$cls\t$c\n";
        }
        print "\n";
#print "$c\n";
    }


    if($i ne $tmpquant)
    {
        next LOOP;
    }

}

0 个答案:

没有答案