Question

我是这个网站的新手，需要帮助从多个文本文件中删除重复的条目（在循环中）。执行以下代码，但这不是删除多个文件的重复项，但它适用于单个文件。 / p>

代码：

my $file = "$Log_dir/File_listing.txt";
my $outfile  = "$Log_dir/Remove_duplicate.txt";; 

open (IN, "<$file") or die "Couldn't open input file: $!"; 
open (OUT, ">$outfile") or die "Couldn't open output file: $!"; 
my %seen = ();
{
  my @ARGV = ($file);
  # local $^I = '.bac';
  while(<IN>){
    print OUT $seen{$_}++;
    next if $seen{$_} > 1;
    print OUT ;
  }
}

谢谢，艺

Answer 1

脚本中的错误：

您使用@ARGV覆盖$file的新副本，因此它永远不会再有文件参数。
...这没关系，因为你在分配给@ARGV之前打开了文件句柄，而且你没有循环参数，你只需要在代码周围有一个块{ ... }这没有任何意义。
%seen将包含您打开的所有文件的重复数据删除数据，除非您重置它。
您将计数$seen{$_}打印到输出文件，我相信您不需要。

您可以使用菱形运算符隐式打开@ARGV参数，但由于您（可能）需要为每个新文件指定正确的输出文件名，因此这是一个不需要的复杂功能。

use strict;
use warnings;                      # always use these

for my $file (@ARGV) {             # loop over all file names
    my $out = "$file.deduped";     # create output file name
    open my $infh,  "<", $file or die "$file: $!";
    open my $outfh, ">", $out  or die "$out: $!";
    my %seen;
    while (<$infh>) {
        print $outfh $_ if !$seen{$_}++;   # print if a line is never seen before
    }
}

请注意，使用词法范围的%seen变量会使脚本检查每个文件中的重复项。如果将变量移到for循环之外，则将检查所有文件中的重复项。我不确定你喜欢哪个。

Answer 2

我认为您的File_listing.txt包含多行，其中一些有多次出现？如果是这种情况，只需使用bash shell：

sort --unique <File_listing.txt >Remove_duplicate.txt

或者，如果您更喜欢Perl：

perl -lne '$seen{$_}++ and next or print;' <File_listing.txt >Remove_duplicate.txt

从perl中的多个文本文件中删除重复条目？

2 个答案: