Question

我创建了一个Perl脚本，用于读取包含一些数字的文件，一个在另一个下面。我想消除重复和重复将新列表保存到文件中。这是我的剧本：

use strict;

my $arg = "<abs path to>\\list.txt";
open (FH, "$arg") or die "\nError trying to open the file $arg : $!";
print "Opened File : $arg\n";
my $line = "";
my @lines = <FH>;
close FH;
my $temp;
my $count = 0;
my $check = 0;
my @list;
my $flag;

for $line (@lines)
{
    $count += 1;
    $check = $count;
    $flag = 1;
    for my $next (@lines)
    {
        $check -= 1;
        if($check < 0)
        {
            if ($line == $next)
            {
                $flag = 0;
            }
        }
    }

    if($flag == 1)
    {
        push (@list, $line);
    }
}

my $newarg = "<abs path to>\\new_list.txt";
open (FWH, ">>$newarg") or die "\nError trying to open the file $newarg for writing : $!";
my $size = @list;
print FWH "\n\n*** Size = $size ***\n\n";
for my $line (@list)
{
    print FWH "$line";
}

我是一个试图学习Perl的C ++人。那么请你在Perl中建议一些可能减少脚本大小的API。我希望脚本具有可读性和可读性。因此很容易理解间距。谢谢。

Answer 1

添加到您的编码风格并不多，只需阅读评论：

my $arg = "<abs path to>\\list.txt";

# Use lexical file handles and 3 argument form of open:
open my $FH, '<', $arg or die "\nError trying to open the file $arg : $!";
print "Opened File : $arg\n";

my @lines = <$FH>;
close $FH;

# Define each variable in the tightest scope possible.
my $count = 0;
my @list;

for my $line (@lines)
{
    $count += 1;
    my $check = $count;
    my $flag = 1;
    for my $next (@lines)
    {
        $check -= 1;
        if($check < 0)
        {
            if ($line == $next)
            {
                $flag = 0;
            }
        }
    }

    if ($flag == 1)
    {
        push @list, $line;
    }
}

my $newarg = "<abs path to>\\new_list.txt";
open my $FWH, '>>', $newarg or die "\nError trying to open the file $newarg for writing : $!";
my $size = @list;
print $FWH "\n\n*** Size = $size ***\n\n";
for my $line (@list)
{
    # Double quotes not needed if there is nothing to interpolate.
    print $FWH $line;
}
# You forgot to close the file. For output files, this is important.
close $FWH or die "\nCannot close $newarg: $!";

然而，这就是我实现算法的方法：

#!/usr/bin/perl
use warnings;
use strict;

my $input_file  = 'PATH/TO/FILE.TXT';
my $output_file = "$input_file.out";

open my $IN,  '<', $input_file  or die "Cannot open $input_file: $!\n";
open my $OUT, '>', $output_file or die "Cannot open $output_file: $!\n";

my $previous = 'inf';
while (my $line = <$IN>) {
    print $OUT $line if $previous != $line;
    $previous = $line;
}

close $OUT;

Answer 2

所以你有一个数字文件，你想在保留订单的同时从中删除重复数据？这是Perl中的单行。

perl -ne 'print unless $seen{$_}++' file > newfile

或者：

# saves original in file.bak
perl -i.bak -ne 'print unless $seen{$_}++' file

如果你的行包含一个不是一个数字，或者你想要打印出一些统计数据，或者你想要更好的参数处理，或者你注意到这不会减少数字不同的空白，然后继续并适当地改变它。例如：

# whitespace/non-numbers tolerant
perl -i.bak -ne 'if (/^\s*(\d+)\s*$/) { print unless $seen{$1}++ } else { print }'

作为一个脚本，关键逻辑完全相同：

#! /usr/bin/env perl
use common::sense;
use autodie;

my $silent;
$silent = shift if (@ARGV > 0 and $ARGV[0] eq '-s');
die "usage: $0 [-s] src dest\n" unless @ARGV == 2;

open my $fi, '<', shift;
open my $fo, '>', shift;

my %seen;
while (<$fi>) {
  if (/^\s* (\d+) \s*$/x) {
    print {$fo} $_ unless $seen{$1}++;
    next;
  }
  print {$fo} $_;
}

unless ($silent) {
  say '-- de-dup stats --';
  say '-- $count $number --'
}
for (sort { $a <=> $b } keys %seen) {
  say "$seen{$_} $_"
}

编辑：嘿，我甚至没有考虑重复项都相邻的情况。这里不需要哈希：

perl -ne 'print unless $_ == $last; $last = $_' file > newfile

Answer 3

每当你必须跟踪某些事情时，请考虑 hash 。哈希有几个非常好的属性：

只能存在其中一个密钥：想象一下，如果您将所有数字存储在由该数字键入的哈希值中。密钥列表包含您的所有号码，并且没有重复项。
快速密钥查找：想象一下，您将数字存储在哈希中，再次按数字键入。你之前看过这个号码吗？查看该密钥是否存在。快速，简单。

这是一个快速的返工。

#! /usr/bin/env perl
use strict;
use feature qw(say);
use warnings;
use autodie;

请注意，我有use warnings以及use strict。我告诉人们use strict可以捕获大约90％的错误。那么，use warnings可以捕获另外9.99％的错误。警告适用于尝试打印未定义的变量或者可能会让您遇到麻烦的不良语法内容。

use feature qw(say);允许您使用say代替print。使用say时，会包含NL，因此您不必一直使用\n。它听起来不是很多，但它很好。如果您无法打开文件，use autodie会执行自动终止程序的操作。它将Perl变成了一种基于异常的语言。这样，如果您忘记测试某些内容，您的程序会通知您。

use constant {
    FILE         => '/path/to/file',
    OUTPUT       => '/path/to/output/file',
};

当你需要一些不变的东西时，你应该使用

Constants。

open my $numfile_fh, "<", FILE;  #No need for die
open my $output_fh, ">", OUTPUT;
my %number_hash;
while ( my $number = <$numfile_fh> ) {
    chomp $number;   #Always chomp after you read
    if ( not exists $number_hash{$number} ) {
        $number_hash{$number} = 1;
        say $output_fh "$number";
    }
}
close $numfile_fh;
close $output_fh;

我一次只读取一个数字，但不是简单地将其写入文件，而是检查我的%number_hash以查看我是否已经看过该号码。如果我没有，我将它存储在我的%number_hash中并打印出来。逻辑可以这样写：

while ( my $number = <$numfile_fh> ) {
    chomp $number;   #Always chomp after you read
    next if exists $number_hash{$number};

    $number_hash{$number} = 1;
    say $output_fh "$number";
}

有些人会说这是编写循环逻辑的更好方法。在这种风格中，您将消除例外（重复的数字），然后处理默认情况（打印读入的数字并将其保存在哈希中）。

请注意，这些都不会改变列表的顺序。你读了一个数字，只要它不是重复的，你按照你读它的顺序打印它。如果你想重新排序数字，所以它们被排序，使用两个循环：

while ( my $number = <$numfile_fh> ) {
    chomp $number;   #Always chomp after you read
     $number_hash{$number} = 1;
}

for my $number ( sort keys %number_hash ) {
    say $output_fh "$number";
}

请注意，我不打算测试数字是否在数组中。没有必要这样做，因为哈希每个值只能有一个密钥。

Answer 4

为什么你不能简单地使用像awk这样的其他工具：

awk '!_[$0]++' your_file

你在perl中也有一个实用程序来获取数组中的uniq元素：

use List::MoreUtils qw/ uniq /;
my @unique = uniq @lines;

如果您不想使用上述实用程序，可以采用以下方法：

my %seen;
my @unique = grep { ! $seen{$_}++ } @faculty;

或者您可以使用下面的函数来获取uniq元素：

sub uniq {
    return keys %{{ map { $_ => 1 } @_ }};
}

将上述内容称为：uniq(@myarray);

有关Perl脚本改进的建议？

4 个答案: