Question

我将不得不编写一个perl程序（用于练习），该程序检查文本文件中是否有相同的单词，然后将它们打印到新文件（没有双打）。

是的，请有人帮助我。我知道使用m //函数我可以查找单词，但是如何查找我可能不知道的单词呢？例如：如果文本文件有：

你好，你好，你好吗？我可能希望将这个文件复制到一个新文件，不用一个'你好'。当然，我不知道文件中是否有任何重复的单词...这是程序搜索重复单词的想法。

我有一个基本的脚本按字母顺序排序，但是找到重复的单词的第2步......我无法弄明白。这是脚本（希望到目前为止是正确的）：

#!/usr/bin/perl 
use strict;
use warnings;

my $source = shift(@ARGV);
my $cible = shift(@ARGV);

open (SOURCE, '<', $source) or die ("Can't open $source\n");
open (CIBLE, '>', $cible) or die ("Can't open $cible\n");

my @lignes = <SOURCE>;
my @lignes_sorted = sort (@lignes);

print CIBLE @lignes_sorted;

chomp @lignes;
chomp @lignes_sorted;

print "Original text : @lignes\n";

sleep (1);

print "Sorted text : @lignes_sorted\n"; 

close(SOURCE);
close (CIBLE);

Answer 1

从句子中删除单词比听起来更复杂。例如，如果在空格上拆分句子，您将获得包含非单词字符的“{”字样，例如Hello,，并且计为真实单词Hello的非重复字词。有许多变量需要考虑，但假设最简单的情况是除了空格之外的所有字符组成了合法的单词，你可以这样做：

$ perl -anlwe '@F=grep !$seen{$_}++, @F; print "@F";' hello.txt
Hello, how are you?
yada Yada this is test material dupe Dupe

$ cat hello.txt
Hello, Hello, how are you?
yada Yada this is test material dupe dupe Dupe

如您所见，它不会考虑重复yada和Yada。它也不会认为Hello是Hello,的副本。您可以通过添加lc或uc的用法来消除案例依赖性，并允许使用与空格不同的分隔符来调整此值。

我们在这里做的是使用哈希%seen来跟踪之前出现过的单词。基本程序是：

while (<>) {         # reading input file or stdin
    @F = split;      # splitting $_ on whitespace by default
    @F = grep !$seen{$_}++, @F;   # remove duplicates
    print "@F";      # print array elements space-separated 
}

!$seen{$_}++的功能是第一次输入新密钥时，表达式将返回true，其他所有时间都为false。它是如何工作的？这些是发生的不同步骤：

$seen{$_}     # value for key $_ is fetched
$seen{$_}++   # value for key $_ is incremented, undef -> 1
              # $foo++ returns the value *before* it is incremented, 
              # so it returns undef
!$seen{$_}++  # this is now "! undef", meaning "not false", as in true.

对于1及以上的值，这些都是真的，not运算符将它们全部否定为假。

Answer 2

Perl：

#!/usr/bin/perl -w
use strict;

my $source = shift(@ARGV);
my $cible = shift(@ARGV);

open (SOURCE, '<', $source) or die ("Can't open $source\n");
open (CIBLE, '>', $cible) or die ("Can't open $cible\n");

my @input = sort <SOURCE>;
my %words = ();
foreach (@input) {
    foreach my $word (split(/\s/)) {
        print CIBLE $word." " unless ( exists $words{$word} );
        $words{$word} = 1;
    }
}

close(SOURCE);
close (CIBLE);

基本思想是将整个文本拆分为单个单词（使用split函数），然后使用此单词作为键构建哈希。阅读下一个单词时，只需检查该单词是否已经在哈希中。如果是 - 它是重复的。

对于字符串Hello, Hello, how are you?，它会打印：Hello, how are you?。

Answer 3

如果您不担心找到具有不同大小写的重复单词，那么您可以通过一次替换来完成此操作。

use strict;
use warnings;

my ($source, $cible) = @ARGV;

my $data;
{
    open ($source_fh, '<', $source) or die ("Can't open $source\n");
    local $/;
    $data = <$source_fh>;
}

$data =~ s/\b(\w+)\W+(?=\1\b)//g;

open (my $cible_fh, '>', $cible) or die ("Can't open $cible\n");
print $cible_fh $data;

Answer 4

不知道如何在Perl中完成它，但可以使用sed和几个Unix实用程序轻松完成。算法将是：

通过用换行符替换空格来分隔所有单词
对单词进行排序
通过uniq发送已排序单词列表，使用-c选项（单词数）
删除所有只出现一次的单词（第一列中的计数为1）

命令将变为（由TAB替换\ t，用ENTER替换\ n）

sed 's/[ \t,.][ \t,.]*/\n/g' filename | sort | uniq -c | sed '/^  *\<1\>/d'

希望有所帮助。

寻找双打词

4 个答案: