使用quotemeta()在Perl中进行精确的字符串匹配

时间:2011-12-30 21:15:22

标签: regex perl

我正在尝试在Perl中使用quotemeta。以下是包含字符串和我试图找到的模式的代码:

open FH, "<query.txt";

@foo = <FH>;
my $bar = "A lymph node Elspar (Merck & Co. Inc) Thyrogen (Genzyme Inc) metastasis 
PEG-Intron  (Schering Corp) specimen from a human testicular embryonal carcinoma with
 elements of a choriocarcinoma Secremax, SecreFlo Secremax, SecreFlo (Repligen Corp)";

foreach my $word(@foo) {
chomp $word;
if ($bar =~ /\b\Q$word\E\b/i)
{
print "$word\n";
}
}

说, query.txt 是一个文件,其中包含我要在字符串中找到的以下术语:

Elspar (Merck & Co. Inc)
Thyrogen (Genzyme Inc)
PEG-Intron  (Schering Corp)
Secremax, SecreFlo
Secremax, SecreFlo (Repligen Corp)

我的代码似乎不起作用,我不明白出了什么问题。

更新:

If $bar = "A lymph node Elspar (Merck & Co. Inc) Thyrogen (Genzyme Inc) metastasis 
PEG-Intron  (Schering Corp) specimen from a human testicular embryonal carcinoma with
 elements of a choriocarcinoma Secremax, SecreFlo Secremax, SecreFlo (Repligen Corp)
specimen from a human testicular embryonal carcinoma with elements of a choriocarcinoma
was successfully  xenotransplanted into nude mice and maintained until the tenth animal
passage. Electron microscopy of the tumors in nude mice revealed details Secremax,
SecreFlo consistent with their epithelial origin.";

query.txt 还包含以下术语:

 pa
 the
 scopy
 ealed

4 个答案:

答案 0 :(得分:6)

问题在于您用搜索包围的\b\b仅匹配\w字符和非\w字符(或字符串的开头或结尾)。由于)不是单词字符,也不是空格,\)\b") "不匹配。

解决方案取决于您正在尝试做什么。也许你想要

$bar =~ /(?<!\w)\Q$word\E(?!\w)/i

其中说匹配不得触及任何一方的\w字符。

对更新的回应:

除了the,您的查询字符串不是单词。如果您想匹配部分单词,那么您根本不需要\b。这听起来像你的意思:

$bar =~ /\Q$word\E/i

这意味着“只是找到$word,而我并不关心触及它的是什么。”

答案 1 :(得分:4)

\b仅匹配单词边界,但您的某些模式以括号结尾,而不是单词边界。相反,请使用正则表达式/(?<!\w)\Q$word\E(?!\w)/i,以确保您的匹配不在单词之前或之后。

答案 2 :(得分:4)

我添加了use strict;use warnings;,在my之前添加了@foo,并在循环中添加了一个print语句:

foreach my $word (@foo)
{
    chomp $word;
    print "Checking $word:\n";
    if ($bar =~ /\b\Q$word\E\b/i)
    {
        print "Match $word\n";
    }
}

然后我在MacOS X 10.7.2(Lion)上获得了Perl 5.12.3的输出:

Checking Elspar (Merck & Co. Inc):
Checking Thyrogen (Genzyme Inc):
Checking PEG-Intron  (Schering Corp):
Checking Secremax, SecreFlo:
Match Secremax, SecreFlo
Checking Secremax, SecreFlo (Repligen Corp):

因此,当$word不包含正则表达式元字符时,模式匹配对我有效。但是,它并不像'\Q..\E符号不起作用那么简单;我将query.txt文件更改为:

Elspar .Merck . Co. Inc.
Thyrogen .Genzyme Inc.
PEG-Intron  .Schering Corp.
Secremax, SecreFlo
Secremax, SecreFlo .Repligen Corp.

并得到与以前相同的结果。这使\b符号成为可疑;你的一些字符串与字边界不匹配。如果我从正则表达式中删除\b标记,那么我得到:

Checking Elspar (Merck & Co. Inc):
Match Elspar (Merck & Co. Inc)
Checking Thyrogen (Genzyme Inc):
Match Thyrogen (Genzyme Inc)
Checking PEG-Intron  (Schering Corp):
Match PEG-Intron  (Schering Corp)
Checking Secremax, SecreFlo:
Match Secremax, SecreFlo
Checking Secremax, SecreFlo (Repligen Corp):
Match Secremax, SecreFlo (Repligen Corp)

你可以保留第一个\b;这给出了相同的结果。紧密的括号会产生问题,因为当后面跟着一个空格(如文本中所示)时,不要标记单词和非单词之间的边界。


修正问题的答案

此代码似乎可以按要求运行。基本上,它看起来如何构建查询:

use strict;
use warnings;

open FH, "<query.txt";

my @foo = <FH>;
#my $bar = "A lymph node Elspar (Merck & Co. Inc) Thyrogen (Genzyme Inc) metastasis PEG-Intron  (Schering Corp) specimen from a human testicular embryonal carcinoma with elements of a choriocarcinoma Secremax, SecreFlo Secremax, SecreFlo (Repligen Corp)";

my $bar =  "A lymph node Elspar (Merck & Co. Inc) Thyrogen (Genzyme Inc) metastasis PEG-Intron  (Schering Corp) specimen from a human testicular embryonal carcinoma with elements of a choriocarcinoma Secremax, SecreFlo Secremax, SecreFlo (Repligen Corp) specimen from a human testicular embryonal carcinoma with elements of a choriocarcinoma was successfully  xenotransplanted into nude mice and maintained until the tenth animal passage. Electron microscopy of the tumors in nude mice revealed details Secremax, SecreFlo consistent with their epithelial origin.";

foreach my $word (@foo)
{
    chomp $word;
    print "Checking $word:\n";
    my ($pfx, $sfx) = ('', '');
    $pfx = '\b' if ($word =~ /^\w/);
    $sfx = '\b' if ($word =~ /\w$/);
    if ($bar =~ /$pfx\Q$word\E$sfx/i)
    {
        print "Match $word\n";
    }
}

示例输出:

Checking Elspar (Merck & Co. Inc):
Match Elspar (Merck & Co. Inc)
Checking Thyrogen (Genzyme Inc):
Match Thyrogen (Genzyme Inc)
Checking PEG-Intron  (Schering Corp):
Match PEG-Intron  (Schering Corp)
Checking Secremax, SecreFlo:
Match Secremax, SecreFlo
Checking Secremax, SecreFlo (Repligen Corp):
Match Secremax, SecreFlo (Repligen Corp)
Checking pa:
Checking the:
Match the
Checking scopy:
Checking ealed:

这对我来说是正确的。它是否适用于所有可能的场景都可以讨论。您可能需要担心(Secremax, Secreflow (Repligen Corp))是否与模式中的'Repligen'匹配,如果不匹配,您必须对构成匹配的内容给出更严格的定义。

答案 3 :(得分:1)

使用quotemeta:

open FH, "<query.txt";

@foo = <FH>;
my $bar = "A lymph node Elspar (Merck & Co. Inc) Thyrogen (Genzyme Inc) metastasis 
PEG-Intron  (Schering Corp) specimen from a human testicular embryonal carcinoma with
 elements of a choriocarcinoma Secremax, SecreFlo Secremax, SecreFlo (Repligen Corp)";

foreach my $word(@foo) {
    chomp $word;

    my $quoted_word = quotemeta($word);

    if ($bar =~ m/$quoted_word/i){
        print "$word\n";
    }
}