Question

我确信这很简单，但我无法弄清楚该做什么...... 我有一个文本文件，里面有一堆单词（我们称之为“wordlist”），组成一个列。然后我有一个大文本文件（让我们称之为“论文”）。我想要做的是在“论文”文件中查找我的“wordlist”中的单词。诀窍是我想知道匹配单词在“论文”中的位置（意思是，在X字符后找到匹配）。

当我查找单个单词时（因此单词列表只包含1个单词），我实际上能够这样做但是在使用列表单词时我无法使其工作。 .. 有什么建议吗？

非常感谢

好的，所以我才意识到它会告诉我“无法找到匹配”......这是代码

use strict;
use warnings;

open (my $wordlist, "<", "/wordlist.txt")
    or die "cannot open < wordlist.txt $!";

open (my $essay, "<", "/essay.txt")
    or die "cannot open < essay.txt $!";


while (<$essay>)    { print "match found\n" if ($essay =~ m/$wordlist/) ; }
            { print "no match found\n" if ($essay !~ m/$wordlist/) ; }

请帮助......？

Answer 1

perl index函数基本匹配substring，它不能确保完整字符串的匹配。基于正则表达式的匹配在这里更有用。

<强>解释

以字符串形式阅读论文的全文。 =＆GT; $essay
对于wordlist.txt =＆gt;中的每个单词$_
- 保持$_与$essay匹配正确的正则表达式。这里使用的是b$_\b
- 对于每场比赛，请收集@-[0]

\b：这里的单词边界字符确保它只匹配完整的单词而不是子串。

@-：是一个特殊变量，包含最后一个正则表达式匹配的起始位置。

以下是示例代码：

use strict;
use warnings;
use 5.010;

my $wordlist_file = 'wordlist.txt';
open my $wordlist_fh, '<', $wordlist_file or die "Failed to open '$wordlist_file': $!";

my %pos;

my $essay_file = 'essay.txt';
my $essay = do {
    local $/ = undef;
    open my $fh, "<", $essay_file
        or die "could not open $essay_file: $!";
    <$fh>;
};

while (<$wordlist_fh>) {
    chomp;
    $pos{$_} = [] unless $pos{$_};
    while($essay =~ m/\b$_\b/g){
      push @{$pos{$_}}, @-;
    }
}

use Data::Dumper;
print Dumper(\%pos);

wordlist文件和论文文件与ThisSuitIsBlackNot提到的相似。

<强> wordlist.txt

I
Perl
hacker

<强> essay.txt

I want to be just another Perl hacker when I grow up
I want to be just another Perl hacker when I grow up

％pos hash现在包含每个单词的所有位置。我只是通过翻斗车展示了它们

$VAR1 = {
          'hacker' => [
                        '31',
                        '84'
                      ],
          'Perl' => [
                      '26',
                      '79'
                    ],
          'I' => [
                   '0',
                   '43',
                   '53',
                   '96'
                 ]
        };

请注意，计数包括每行末尾的换行符。

Answer 2

也许你可以使用index（）函数。

以下是链接：Using the Perl index() function

这是我的样本。表现可能不太好。希望它有所帮助〜:)

open (my $wordlist, "<", "files/wordlist.txt")
    or die "cannot open < wordlist.txt $!";

open (my $essay, "<", "files/essay.txt")
    or die "cannot open < essay.txt $!";

my $words = {};

while (<$wordlist>) {
    chomp($_);
    $words->{$_} = 1;
}

my $row_count = 0;
while (<$essay>) {
    $row_count++;
    chomp($_);
    foreach my $word (keys %{$words}) {
        my $offset = 0;
        my $r = index($_, $word, $offset);

        while ($r != -1) {
            print "Found [$word] in line $row_count at $r\n";
            $offset = $r + 1;
            $r = index($_, $word, $offset);
        }
    }
}

Answer 3

在您的代码中，$essay和$wordlist都是文件句柄。当你说

print "match found\n" if ($essay =~ m/$wordlist/);

您尝试将一个文件句柄的 stringification 与另一个文件句柄的 stringification 相匹配。当文件句柄被字符串化时，它看起来像这样：

GLOB(0x9a26c38)

所以你的代码实际上是这样的：

print "match found\n" if ('GLOB(0x9a26c38)' =~ m/GLOB(0x94bbc38)/);

这不是你想要的。您需要阅读文件的内容并进行比较，而不是文件句柄本身。

每篇文章都有自己的单词

以下代码假定您的“论文”每行包含一个单词。我们将论文文件的内容读入数组的散列，其中行为键，位置数组为值。我们使用数组，以防文件中多次出现相同的单词。第一个单词的位置为零。然后我们遍历单词列表文件，打印单词和第一个匹配位置（如果有的话）。

use strict;
use warnings;
use 5.010;

my $essay_file = 'files/essay.txt';
open my $essay_fh, '<', $essay_file or die "Failed to open '$essay_file': $!";

my $pos = 0;
my %essay;

while (<$essay_fh>) {
    chomp;
    push @{ $essay{$_} }, $pos;
    $pos += length $_;
}

my $wordlist_file = 'files/wordlist.txt';
open my $wordlist_fh, '<', $wordlist_file or die "Failed to open '$wordlist_file': $!";

while (<$wordlist_fh>) {
    chomp;
    say "$_: $essay{$_}[0]" if exists $essay{$_};
}

essay.txt

I
want
to
be
just
another
Perl
hacker
when
I
grow
up

wordlist.txt

I
Perl
hacker

输出

I: 0
Perl: 20
hacker: 24

请注意，我在计算位置值时忽略了换行符。您可以根据需要进行调整。

每行多个单词

如果您的论文文件每行可以有多个单词，我们可以使用正则表达式来检查匹配项：

use strict;
use warnings;
use 5.010;

# Slurp entire essay file into a variable
my $essay = do {
    local $/;
    my $essay_file = 'files/essay.txt';
    open my $essay_fh, '<', $essay_file or die "Failed to open '$essay_file': $!";
    <$essay_fh>;
};

my $wordlist_file = 'files/wordlist.txt';
open my $wordlist_fh, '<', $wordlist_file or die "Failed to open '$wordlist_file': $!";

while (<$wordlist_fh>) {
    chomp;
    say "$_: ", pos($essay) - length($_) if $essay =~ /\b$_\b/g;
}

essay.txt

I want to be just another Perl hacker when I grow up

wordlist.txt

I
Perl
hacker
hack

输出

I: 0
Perl: 26
hacker: 31

请注意，结果与我们的其他程序略有不同，因为现在单词之间有空格。另请注意，单词hack没有输出，因为我们只检查整个单词匹配。

查找并获取文本中单词列表的位置

3 个答案:

每篇文章都有自己的单词

每行多个单词