如何使用Perl检测文件中的多个重复字段?

时间:2015-01-15 23:29:40

标签: perl duplicates field

我的经纪账户中有一堆NETFLIX订单。 我无意中在1/5和1/6上输入了两个重复的gtc销售订单。 如何使用Perl脚本检测它?

 Buy NFLX     50 @  315.00  Reg-Acct Fake
 Buy NFLX     50 @  317.50  Reg-Acct OPEN              01/13/15
Sell NFLX     50 @  345.00  Reg-Acct OPEN              01/05/15
Sell NFLX     50 @  345.00  Reg-Acct OPEN              01/06/15
Sell NFLX     50 @  362.00  Reg-Acct OPEN              11/25/14
...
Sell NFLX     50 @  345.00  IRA-Acct OPEN              09/15/14

我希望脚本只吐出这两行, 由fields[0]通过fields[6]判断为相同。

Sell NFLX     50 @  345.00  Reg-Acct OPEN              01/05/15
Sell NFLX     50 @  345.00  Reg-Acct OPEN              01/06/15

我更喜欢一个简单的脚本(即没有单行,没有哈希),因为我是Perl的新手。

谢谢, 拉里

2 个答案:

答案 0 :(得分:1)

我知道你没说过任何一个班轮,但如果你的意思是没有 perl 单行:

sort filename|rev|uniq -D -f 1|rev

答案 1 :(得分:0)

  

我更喜欢简单的脚本(没有哈希)

唉。错过了无哈希。不幸的是,简单没有哈希是相反的目标 - 更不用说没有哈希意味着效率不高,即< EM>慢。请参阅底部的代码,了解如何执行此操作。与此同时,您需要并行数组:

use strict;
use warnings;
use 5.016;
use Data::Dumper;

my @orders;
my @counts;

my $fname = 'data3.txt';

open my $ORDERSFILE, '<', $fname
    or die "Couldn't open $fname: $!";

LINE:
while (my $line = <$ORDERSFILE>) {
    my @pieces = split ' ', $line;
    my $date = pop @pieces;
    my $order = join ' ', @pieces;

    if (not @orders) { #then length of @orders is 0
        $orders[0] = $order;
        $counts[0] = 1;
        next LINE;
    }

    for my $i (0..$#orders) {
        if ($orders[$i] eq $order) {
            $counts[$i]++;
            next LINE;
        }
    }
    #If execution reaches here, then the order wasn't found in the array...
    my $i = $#counts + 1;
    $orders[$i] = $order;
    $counts[$i] = 1
}

say Dumper(\@orders);
say Dumper(\@counts);


for my $i (0..$#counts) {
    if ($counts[$i] > 1) {
        say "($counts[$i]) $orders[$i]";
    }
}

--output:--
$VAR1 = [
          'Buy NFLX 50 @ 315.00 Reg-Acct',
          'Buy NFLX 50 @ 317.50 Reg-Acct OPEN',
          'Sell NFLX 50 @ 345.00 Reg-Acct OPEN',
          'Sell NFLX 50 @ 362.00 Reg-Acct OPEN',
          'Sell NFLX 50 @ 345.00 IRA-Acct OPEN'
        ];

$VAR1 = [
          1,
          1,
          2,
          1,
          1
        ];

(2) Sell NFLX 50 @ 345.00 Reg-Acct OPEN

以下是一些更好的解决方案:

use strict;
use warnings;
use 5.016;
use Data::Dumper;

my %dates_for;   #A key will be an order; a value will be a reference to an array of dates.

while (my $line = <DATA>) {
    my @pieces = split ' ', $line;
    my $date = pop @pieces;
    my $order = join ' ', @pieces;

    push @{$dates_for{$order}}, $date;  #autovivification (see explanation below)
}

say Dumper(\%dates_for);

my @dates;

for my $order (keys %dates_for) {
    @dates = @{$dates_for{$order}};
    my $dup_count = @dates;

    if ($dup_count > 1) {
        say "($dup_count) $order";
        say "   $_" for @dates;
    }
}


__DATA__
 Buy NFLX     50 @  315.00  Reg-Acct Fake
 Buy NFLX     50 @  317.50  Reg-Acct OPEN              01/13/15
Sell NFLX     50 @  345.00  Reg-Acct OPEN              01/05/15
Sell NFLX     50 @  345.00  Reg-Acct OPEN              01/06/15
Sell NFLX     50 @  362.00  Reg-Acct OPEN              11/25/14
Sell NFLX     50 @  345.00  IRA-Acct OPEN              09/15/14  


--output:--
$VAR1 = {
          'Sell NFLX 50 @ 345.00 IRA-Acct OPEN' => [
                                                     '09/15/14'
                                                   ],
          'Sell NFLX 50 @ 345.00 Reg-Acct OPEN' => [
                                                     '01/05/15',
                                                     '01/06/15'
                                                   ],
          'Buy NFLX 50 @ 317.50 Reg-Acct OPEN' => [
                                                    '01/13/15'
                                                  ],
          'Buy NFLX 50 @ 315.00 Reg-Acct' => [
                                               'Fake'
                                             ],
          'Sell NFLX 50 @ 362.00 Reg-Acct OPEN' => [
                                                     '11/25/14'
                                                   ]
        };

(2) Sell NFLX 50 @ 345.00 Reg-Acct OPEN
   01/05/15
   01/06/15
  

取消引用未定义的变量时,它会以静默方式升级   到数组或散列引用(取决于类型   解引用)。这种行为通常称为自动生成   你的意思是什么(例如,当你存储一个值时)....

http://search.cpan.org/~vpit/autovivification-0.14/lib/autovivification.pm

对于固定宽度列,使用unpack()更有效:

use strict;
use warnings;
use 5.016;
use Data::Dumper;

my $fname = 'data3.txt';

open my $ORDERSFILE, '<', $fname
    or die "Couldn't open $fname: $!";

my %dates_for;

while (my $line = <$ORDERSFILE>) {
    my ($order, $date) = unpack 'A41 @55 A*', $line;   #see explanation below
    push @{$dates_for{$order}}, $date;
}

close $ORDERSFILE;

say Dumper(\%dates_for);

my @dates;

for my $order (keys %dates_for) {
    @dates = @{$dates_for{$order}};

    if (@dates > 1) {
        my $dup_count = @dates;
        say "($dup_count) $order";
        say "   $_" for @dates;
    }
}

--output:--
$VAR1 = {
          ' Buy NFLX     50 @  317.50  Reg-Acct OPEN' => [
                                                           '01/13/15'
                                                         ],
          'Sell NFLX     50 @  362.00  Reg-Acct OPEN' => [
                                                           '11/25/14'
                                                         ],
          'Sell NFLX     50 @  345.00  Reg-Acct OPEN' => [
                                                           '01/05/15',
                                                           '01/06/15'
                                                         ],
          ' Buy NFLX     50 @  315.00  Reg-Acct Fake' => [
                                                           ''
                                                         ],
          'Sell NFLX     50 @  345.00  IRA-Acct OPEN' => [
                                                           '09/15/14'
                                                         ]
        };

(2) Sell NFLX     50 @  345.00  Reg-Acct OPEN
   01/05/15
   01/06/15

A41 @55 A* =&gt;提取41个字符(A),
..............................跳到55号位置(@ 55),
..............................提取剩余的字符(A *)

您可以跳到任何您想要的位置,前进和后退,这意味着您可以按照您想要的任何顺序提取作品。