分组重复检查

Question

我正在尝试从一个相当大的文件中读取数据。我需要能够通过文件读取行并报告文件中以G开头的任何重复记录。

THIS IS THE DATA:
E123456789
G123456789
h12345
E1234567
E7899874
G123456798
G123465798
h1245

这是示例数据，因为其中约有6000行数据混乱。但这是以E，G或h开头的重要数据记录。

到目前为止，这是我的代码：

#!/usr/bin/perl

use strict;
use warnings;

my $infile  = $ARGV[0];
my $found_E = 0;
my $sets    = 0;

open my $ifh, '<', $infile;
while (<$ifh>) {

  if (/^E/) {
    $found_E = 1;
    next;
  }

  if ($found_E) {

    if (/^G/) {
      $sets += 1;
      $found_E = 0;
      next;
    }

    if (/^h/) {
      print "Error! No G Record at line  $.\n";
      exit;
    }
  }
}
close($ifh);

printf "Found %d sets of Enrichment data with G Records \n", $sets;

my @lines;
my %duplicates;
open $ifh, '<', $infile;
while (<$ifh>) {
  @lines = split('', $_);
  if ($lines[0] eq 'G') {
    print if !defined $duplicates{$_};
    $duplicates{$_}++;
  }
}
close($ifh);

正如您所看到的，我正在检查G仅在E条记录之后和h条记录之前发生。第二个循环旨在查找重复项，但现在它只打印所有G个记录。

此外，如果文件中没有E条记录，有人可以建议如何处理报告。

Answer 1

分组重复检查

如果您只想检查组合在一起的重复项，那很容易。您可以检查当前行是否与最后一行相同：

my $line;

while(<$ifh>) {
    next if (defined $line && $line eq $_);
    $line = $_;
    ...

所有重复检查

如果您想检查文件中所有重复的行，无论其位置如何，您都必须执行以下操作：

my %seen;

while (<$ifh>) {
   next if exists $seen{$_};
   $seen{$_} = 1;
   ...

这对于大型文件来说速度不快，因为哈希查找非常差，但如果您不想修改源文件，这是最佳选择。

Answer 2

my %seen_G;
LINE:
while(<$ifh>)
{
    my $c  = substr( $_, 0, 1 );
    if ( $found_E ) { 
        die "Error! No G Record at line  $." if $c eq 'h';
        print if ( $c eq 'G' and not $seen_G{ $_ }++ );
    }
    $found_E = ( $c eq 'E' );
}

Answer 3

目前尚不清楚是否要跳过与前一行重复的行或与之前行重复的行。

跳过与上一行重复的行

如果下一行与最后一行相同，只需获取另一行。

my $last;
while (<>) {
   next if /^G/ && defined($last) && $_ eq $last;
   $last = $_;
   ...
}

我会留给您确定您何时想要查找重复项，但我认为您要向$found_G添加if项检查。

跳过与前一行

重复的行

维护您已经看过的线条的集合。使用哈希将允许快速插入和查找。

my %seen;
while (<>) {
   next if /^G/ && $seen{$_}++;
   ...
}

使用perl检查下一行是否重复

3 个答案:

分组重复检查

所有重复检查

跳过与上一行重复的行

跳过与前一行