Question

我是Perl的新手。我有八个文本文件，每个文件超过五千行。我想编写一个perl脚本来查找在前五个文件中找到但未找到最后三个文件的条目（记录）。假设文件是（A，B，C，D，E，F，G，H），所以我想获取A到E但不在{{1}中的条目转到F。

有人可以就如何为此工作编写代码提出建议吗？

Answer 1

如果我理解正确，您需要：

列出A-E中的所有项目（称之为列表1）
在F-H（列表2）中创建另一个项目列表
查找1中不属于2的所有项目。

不使用两个列表，而是使用两个哈希值。

# Two sets of files to be compared.
my @Set1 = qw(A B C D E);
my @Set2 = qw(F G H);

# Get all the items out of each set into hash references
my $items_in_set1 = get_items(@Set1);
my $items_in_set2 = get_items(@Set2);

my %unique_to_set1;
for my $item (keys %$items_in_set1) {
    # If an item in set 1 isn't in set 2, remember it.
    $unique_to_set1{$item}++ if !$items_in_set2->{$item};
}

# Print them out
print join "\n", keys %unique_to_set1;

sub get_items {
    my @files = @_;

    my %items;
    for my $file (@files) {
        open my $fh, "<", $file or die "Can't open $file: $!";
        while( my $item = <$fh>) {
            chomp $item;
            $items{$item}++;
        }
    }

    return \%items;
}

如果是一次性的，你可以在shell中进行。

cat A B C D E | sort | uniq > set1
cat F G H | sort | uniq > set2
comm -23 set1 set2

cat A B C D E将文件一起涂抹到一个流中。这已移至sort，然后uniq移除重复项（uniq除非对行进行排序，否则效果不佳）。结果将放入文件set1中。这是针对第二组再次完成的。然后在两个集合文件上使用comm来比较它们，仅显示set1唯一的行。

在不同的文本文件中查找公共条目

1 个答案: