Question

我有两个文件作为输入，一个包含单词列表StopWordsList.txt的文件，我想从StopWordsList.txt中删除StopWordsList.txt中的单词，这里是我的代码：

my $FichierResulat = '/home/lenovo/Bureau/MesTravaux/LeskAlgo/OriginalLeskResult';

open( my $FhResultat, '>:utf8', $FichierResulat );

open( my $fh1, "<:utf8", '/home/lenovo/Bureau/MesTravaux/LeskAlgo/DemoLesk/StopWordsList.txt' ) 
        or die "Failed to open file: $!\n"; #file contains stop words

open( my $fh2, "<:utf8", '/home/lenovo/Bureau/MesTravaux/LeskAlgo/text1.txt' ) #file contains text
        or die "Failed to open file: $!\n";

my @tabStopWords = <$fh1>;

my @tab_contexte;
my @words;

while ( <$fh2> ) {
    chomp;
    next if m/^$/;
    my $context = $_;
    @words = split( / /, $_ );
}
#compare: remove from @words the words existing in @tabStopWords
my %temp;

@temp{@tabStopWords} = 0 .. $#tabStopWords;

for my $val ( @words ) {

    if ( exists $temp{$val} ) {
        print "$val est présent dans tab1 à la position $temp{$val}.\n";
    }
    else {
        print "$val n'est pas dans tab1.\n";
        push @tab_sans_SW, $val;
    }
}

foreach my $value ( @tab_sans_SW ) {
    print $FhResultat "$value\n";
}

但是在结果文件中我有@words中存在的所有单词而没有删除@tabStopWords中存在的单词。我希望你能帮助我。

我的sotpwords文件： ال الآن التي الذي الذين اللاتي اللائي اللتان اللتين

我的texte文件： ومواصلاتبمافيهمنبريدونورومياهوصناعاتوعلومومعارفوحينمايركباحدناقطارافإنهيركبفينفسالوقتعلىحريةجاهزةاعدهالهآلافالعمالوالمخترعينوالمهندسينفي

Answer 1

有几个问题

您不必chomp @tabStopWords的内容，因此每个条目的结尾都有换行符
您每次使用@words围绕while循环覆盖@words = split(/ /, $_)的内容，而不是添加

这个程序会做你想要的。我添加了use autodie以避免检查每个open的结果，并且我删除了一些未使用的变量。使用小写字母和下划线更好地编写局部变量名称，特别是对于第一语言不是英语的读者

我在两个文件上都使用split来将它们都减少为单个单词。由于split也会删除换行符，因此不需要chomp

use strict;
use warnings 'all';
use autodie;

use constant FICHIER_STOP_WORD => '/home/lenovo/Bureau/MesTravaux/LeskAlgo/DemoLesk/StopWordsList.txt';
use constant FICHIER_TEXTE     => '/home/lenovo/Bureau/MesTravaux/LeskAlgo/text1.txt';
use constant FICHIER_RESULAT   => '/home/lenovo/Bureau/MesTravaux/LeskAlgo/OriginalLeskResult';


my @tab_stop_words = do {
    open my $fh1, "<:utf8", FICHIER_STOP_WORD;
    map { split } <$fh1>;
};

my @words = do {
    open my $fh1, "<:utf8", FICHIER_TEXTE;
    map { split } <$fh1>;
};

my %words = map { $words[$_] => $_ } 0 .. $#words;

open my $fh_resultat, '>:utf8', FICHIER_RESULAT;

for my $word ( @words ) {

    my $position = $words{$word};

    if ( defined $position ) {
        print "$word est présent dans tab1 à la position $position.\n";
    }
    else {
        print "$word n'est pas dans tab1.\n";
        print $fh_resultat "$word\n";
    }
}

Answer 2

如果向我们展示了两个输入文件的格式，这个问题会更容易解决。但是，如果你不这样做，这将是猜测。

我猜你的停用词文件在每一行都包含一个单词。在这种情况下，@tabStopWords中的每个元素以及%temp中的每个键都会在其末尾添加换行符。这使得源文件中的任何单词都不太可能与这些键匹配。

您可能想要添加：

chomp @tabStopWords;

代码。

Answer 3

我们可以使用智能匹配运算符（~~），

来获得差异

my(@words_arr) = ("is","a");
my(@input_arr) = ("This","is","a","example","code");
my (@diff)  = grep { not $_ ~~ @words_arr} @input_arr;

如何比较两个数组字符串

3 个答案: