匹配不在引用之间的表达式

时间:2014-04-07 10:11:35

标签: regex perl

我正在尝试从字符串中提取表达式。但我需要这个表达式不要在引号之间。

例如,我需要能够从这种字符串中提取returnifeq(perl关键字):return 3 + 4 if $string eq "time is eq";

我不会检索time,即使它是关键字,因为它在引号之间。

到目前为止,这是我尝试过的:

  • m/((("(\\"|[^"])*")|[^"])*)(^|\s|<)+${keywords}(\s|\(|>|$|\n)/g

  • 我还试图检查字符串的结尾,看看我是否有偶数/奇数引号。

第一个解决方案的问题是,如果有空格,那么它与关键字不匹配,然后是关键字。 第二个解决方案的问题是,如果字符串中有第二个关键字,并且第二个关键字后面的引号之间有一个字符串,则最新的匹配项将匹配。

我想知道如何解决这个问题我已经做了一段时间了,这让我发疯了! :)

谢谢!

PS:有人问我这里的关键字列表是:

my $keywords = qr/(-A|END|length|setpgrp|-B|endgrent|link|setpriority|-b|endhostent|listen|setprotoent|-C|endnetent|local|setpwent|-c|endprotoent|localtime|setservent|-d|endpwent|log|setsockopt|-e|endservent|lstat|shift|-f|eof|map|shmctl|-g|eval|mkdir|shmget|-k|exec|msgctl|shmread|-l|exists|msgget|shmwrite|-M|exit|msgrcv|shutdown|-O|fcntl|msgsnd|sin|-o|fileno|my|sleep|-p|flock|next|socket|-r|fork|not|socketpair|-R|format|oct|sort|-S|formline|open|splice|-s|getc|opendir|split|-T|getgrent|ord|sprintf|-t|getgrgid|our|sqrt|-u|getgrnam|pack|srand|-w|gethostbyaddr|pipe|stat|-W|gethostbyname|pop|state|-X|gethostent|pos|study|-x|getlogin|print|substr|-z|getnetbyaddr|printf|symlink|abs|getnetbyname|prototype|syscall|accept|getnetent|push|sysopen|alarm|getpeername|quotemeta|sysread|atan2|getpgrp|rand|sysseek|AUTOLOAD|getppid|read|system|BEGIN|getpriority|readdir|syswrite|bind|getprotobyname|readline|tell|binmode|getprotobynumber|readlink|telldir|bless|getprotoent|readpipe|tie|break|getpwent|recv|tied|caller|getpwnam|redo|time|chdir|getpwuid|ref|times|CHECK|getservbyname|rename|truncate|chmod|getservbyport|require|uc|chomp|getservent|reset|ucfirst|chop|getsockname|return|umask|chown|getsockopt|reverse|undef|chr|glob|rewinddir|UNITCHECK|chroot|gmtime|rindex|unlink|close|goto|rmdir|unpack|closedir|grep|say|unshift|connect|hex|scalar|untie|cos|index|seek|use|crypt|INIT|seekdir|utime|dbmclose|int|select|values|dbmopen|ioctl|semctl|vec|defined|join|semget|wait|delete|keys|semop|waitpid|DESTROY|kill|send|wantarray|die|last|setgrent|warn|dump|lc|sethostent|write|each|lcfirst|setnetent|__DATA__|else|lock|qw|__END__|elsif|lt|qx|__FILE__|eq|m|s|__LINE__|exp|ne|sub|__PACKAGE__|for|no|tr|and|foreach|or|unless|cmp|ge|package|until|continue|gt|q|while|CORE|if|qq|xor|do|le|qr|y|ARGV|STDERR|STDOUT|ARGVOUT|STDIN)/;

2 个答案:

答案 0 :(得分:1)

一个想法:

搜索并替换为空白:

".*?"|\$\w+

这将删除所有变量字符串,即以$开头,所有单词都在双引号内。

在此之后,您可以使用正常的正则表达式查找要匹配的所有单词:

/(-A|END|length|setpgrp|-B|endgrent|link|setpriority|-b|endhostent|listen|setprotoent|-C|endnetent|local|setpwent|-c|endprotoent|localtime|setservent|-d|endpwent|log|setsockopt|-e|endservent|lstat|shift|-f|eof|map|shmctl|-g|eval|mkdir|shmget|-k|exec|msgctl|shmread|-l|exists|msgget|shmwrite|-M|exit|msgrcv|shutdown|-O|fcntl|msgsnd|sin|-o|fileno|my|sleep|-p|flock|next|socket|-r|fork|not|socketpair|-R|format|oct|sort|-S|formline|open|splice|-s|getc|opendir|split|-T|getgrent|ord|sprintf|-t|getgrgid|our|sqrt|-u|getgrnam|pack|srand|-w|gethostbyaddr|pipe|stat|-W|gethostbyname|pop|state|-X|gethostent|pos|study|-x|getlogin|print|substr|-z|getnetbyaddr|printf|symlink|abs|getnetbyname|prototype|syscall|accept|getnetent|push|sysopen|alarm|getpeername|quotemeta|sysread|atan2|getpgrp|rand|sysseek|AUTOLOAD|getppid|read|system|BEGIN|getpriority|readdir|syswrite|bind|getprotobyname|readline|tell|binmode|getprotobynumber|readlink|telldir|bless|getprotoent|readpipe|tie|break|getpwent|recv|tied|caller|getpwnam|redo|time|chdir|getpwuid|ref|times|CHECK|getservbyname|rename|truncate|chmod|getservbyport|require|uc|chomp|getservent|reset|ucfirst|chop|getsockname|return|umask|chown|getsockopt|reverse|undef|chr|glob|rewinddir|UNITCHECK|chroot|gmtime|rindex|unlink|close|goto|rmdir|unpack|closedir|grep|say|unshift|connect|hex|scalar|untie|cos|index|seek|use|crypt|INIT|seekdir|utime|dbmclose|int|select|values|dbmopen|ioctl|semctl|vec|defined|join|semget|wait|delete|keys|semop|waitpid|DESTROY|kill|send|wantarray|die|last|setgrent|warn|dump|lc|sethostent|write|each|lcfirst|setnetent|__DATA__|else|lock|qw|__END__|elsif|lt|qx|__FILE__|eq|m|s|__LINE__|exp|ne|sub|__PACKAGE__|for|no|tr|and|foreach|or|unless|cmp|ge|package|until|continue|gt|q|while|CORE|if|qq|xor|do|le|qr|y|ARGV|STDERR|STDOUT|ARGVOUT|STDIN)/g

答案 1 :(得分:0)

如果您想以特殊方式处理单引号和双引号,答案是尝试匹配它们,就像您匹配关键字一样。然后,您只需再执行一步,根据捕获的内容过滤所有匹配项。

另外两个提示

  • 不要在缓存的正则表达式(...)中包含捕获组qr{},该正则表达式将在更大的正则表达式中进行插值。而是使用非捕获组(?:...)。这是因为您希望能够在执行测试时准确查看正在捕获的内容,而无需向上滚动以确定所包含的正则表达式的确切编码。
  • 在尝试匹配特定字词时,请勿忘记字词边界\b

以下是我如何重新编写您的尝试:

use strict;
use warnings;

my $singlequote_re = qr{'(?:(?>[^']+)|\\.)*'};
my $doublequote_re = qr{"(?:(?>[^"]+)|\\.)*"};
my $keywords_re = qr/(?:-A|END|length|setpgrp|-B|endgrent|link|setpriority|-b|endhostent|listen|setprotoent|-C|endnetent|local|setpwent|-c|endprotoent|localtime|setservent|-d|endpwent|log|setsockopt|-e|endservent|lstat|shift|-f|eof|map|shmctl|-g|eval|mkdir|shmget|-k|exec|msgctl|shmread|-l|exists|msgget|shmwrite|-M|exit|msgrcv|shutdown|-O|fcntl|msgsnd|sin|-o|fileno|my|sleep|-p|flock|next|socket|-r|fork|not|socketpair|-R|format|oct|sort|-S|formline|open|splice|-s|getc|opendir|split|-T|getgrent|ord|sprintf|-t|getgrgid|our|sqrt|-u|getgrnam|pack|srand|-w|gethostbyaddr|pipe|stat|-W|gethostbyname|pop|state|-X|gethostent|pos|study|-x|getlogin|print|substr|-z|getnetbyaddr|printf|symlink|abs|getnetbyname|prototype|syscall|accept|getnetent|push|sysopen|alarm|getpeername|quotemeta|sysread|atan2|getpgrp|rand|sysseek|AUTOLOAD|getppid|read|system|BEGIN|getpriority|readdir|syswrite|bind|getprotobyname|readline|tell|binmode|getprotobynumber|readlink|telldir|bless|getprotoent|readpipe|tie|break|getpwent|recv|tied|caller|getpwnam|redo|time|chdir|getpwuid|ref|times|CHECK|getservbyname|rename|truncate|chmod|getservbyport|require|uc|chomp|getservent|reset|ucfirst|chop|getsockname|return|umask|chown|getsockopt|reverse|undef|chr|glob|rewinddir|UNITCHECK|chroot|gmtime|rindex|unlink|close|goto|rmdir|unpack|closedir|grep|say|unshift|connect|hex|scalar|untie|cos|index|seek|use|crypt|INIT|seekdir|utime|dbmclose|int|select|values|dbmopen|ioctl|semctl|vec|defined|join|semget|wait|delete|keys|semop|waitpid|DESTROY|kill|send|wantarray|die|last|setgrent|warn|dump|lc|sethostent|write|each|lcfirst|setnetent|__DATA__|else|lock|qw|__END__|elsif|lt|qx|__FILE__|eq|m|s|__LINE__|exp|ne|sub|__PACKAGE__|for|no|tr|and|foreach|or|unless|cmp|ge|package|until|continue|gt|q|while|CORE|if|qq|xor|do|le|qr|y|ARGV|STDERR|STDOUT|ARGVOUT|STDIN)/;

while (<DATA>) {
    print "Line = $_";
    while (m{($singlequote_re)|($doublequote_re)|(\b$keywords_re\b)}g) {
        my $singlequote = $1;
        my $doublequote = $2;
        my $keyword = $3;

        if (defined $singlequote) {
            print "  Single quote = <$singlequote>\n";

        } elsif (defined $doublequote) {
            print "  Double quote = <$doublequote>\n";

        } elsif (defined $keyword) {
            print "  Keyword = <$keyword>\n";
        }
    }
}

__DATA__
return 3 + 4 if $string eq "time is eq";

输出:

Line = return 3 + 4 if $string eq "time is eq";
  Keyword = <return>
  Keyword = <if>
  Keyword = <eq>
  Double quote = <"time is eq">

请注意,如果这只是一个练习,那么适当的建议就是使用PPI,因为尝试用正则表达式解析perl是一种愚蠢的差事。总是存在边缘情况,因此使用实际词法分析器只是真正解决这个问题的方法。