Question

我已经陷入了一些堵塞，并想知道是否有人可以清理它。我想做的是：

打开一堆包含.txt文件的数据
创建一个包含@array [@filenames] [@ data]
根据数据

这里我将文件粘贴到变量中，使用正则表达式获取我的数据并将其放入数组中：

    while (my $row = <$fh>) {
        unless ($. == 0) {
            {
            local $/; # enable slurp
            @datalist = <$fh> =~ /\s*\d*\/\s*\d*\|\s*(.*?)\|.*?(?:.*?\|){4}\s*(\S*)\|(\S*).*\|/g; #extract article numbers # $1 = article number, $2 = quantity, $3 = unit
            }
            push(@arrayofarrays,[@datalist]);
            push(@filenames,$file);
            last;
            }
        }
        $numr++;
}
open(my $feh,">","test.txt");
print {$feh} Dumper \@arrayofarrays;

Dumper显示我的数据看起来很好（伪成像使其易于阅读和缩短）：

$VAR1 = [
          [
            'data type1',
            'data type2',
            'data type3',
            'data type1',
            'data type2',
            'data type3',
            ...
          ],
          [
            'data type1',
            'data type2',
            'data type3',
            ...
          ],
        ...
     ];

所以我想知道是否有人知道检查数据集之间重复的简单方法？我知道我可以使用

我尝试的可能会更好地了解我需要做什么：

my $i = 0;
my $j = 0;
while ( $i <= scalar @arrayofarrays) {
    $j = 0;
    while ( $j <= scalar @arrayofarrays) {
        if (@{$arrayofarrays[$i]} eq @{$arrayofarrays[$j]}) {
            print "\n'$filenames[$i]' is duplicate to '$filenames[$j]'.";
            } $j++;
        } $i++;
    }

Answer 1

而不是数组数组我创建了一个数组哈希，从子数组生成密钥＆＃39;数据通过将子数组展平为字符串，可选择将它们转换为校验和（这适用于多维子数组）。您可能想要阅读关于PerlMonks的讨论：

http://www.perlmonks.org/?node_id=1121378

抽象示例给出了一个已存在的数组，在子数组中有重复数据（您可以测试它here on ideone.com）：

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my @array = (
    [1,'John','ABXC12132328'],
    [0,'John','ABXC12132322'],
    [0,'John','ABXC12132322'],
    [0,'John','ABXC12132322'],
    [0,'John','ABXC12132322'],
    [0,'John','ABXC12132322'],
    [0,'John','ABXC12132322'],
    [0,'John','ABXC12132322'],
    [0,'John','ABXC12132322'],
    [0,'John','ABXC12132322'],
    [0,'John','ABXC12132322'],
    [0,'John','ABXC12132322'],
    [0,'John','ABXC12132322'],
    [0,'John','ABXC12132322'],
    [0,'John','ABXC12132322']
);
my %uniq_helper = ();
my @uniq_data = grep { !$uniq_helper{"@$_"}++ } @array;
print Dumper(\%uniq_helper) . "\n";
print Dumper(\@uniq_data) . "\n";

对于你的情况，它可能看起来像这样：

my %datalist;
while (my $row = <$fh>) {
    unless ($. == 0) {
        {
            local $/; # enable slurp
            @data = <$fh> =~ /\s*\d*\/\s*\d*\|\s*(.*?)\|.*?(?:.*?\|){4}\s*(\S*)\|(\S*).*\|/g; #extract article numbers # $1 = article number, $2 = quantity, $3 = unit
        }
        $datalist{"@data"} = \@data;
        push(@filenames,$file);
        last;
    }
}
$numr++;

Answer 2

创建@dataList时，为其创建一个密钥并在执行推送之前检查该密钥，如：

my %checkHash=undef;
my $key=arrayKey(\@datalist);
if (!$checkHash{$key}) {
    push(@arrayofarrays,[@datalist]);
    push(@filenames,$file);
    $checkHash{$key}=1;
    last;
}

sub arrayKey($) {
    my $arrayRef = shift;
    my $output=undef;
    for (@$arrayRef) {
        if (ref($_) eq 'ARRAY') {
            $output.="[";
            $output.=arrayKey($_);
            $output.="]";
        }
        else {
            $output.="$_,";
        }
    }
    return $output;
}

在数组内寻找重复数组（多维数组）

2 个答案: