如何在Flex中编写以下正则表达式?

时间:2014-11-10 15:56:12

标签: c++ regex flex-lexer

我正在尝试在flex中定义一条捕获“多行字符串”的规则 多行字符串是一个以三个撇号开头的字符串:''',以三个撇号结尾,并且可以跨越多行。
例如:

'''This is
an example of
a multiline
string'''

所以我的尝试是这样的:

%{
#include<iostream>
using std::cout;
using std::endl;

%}

MULTI_LN_STR    '''(.|\n)*'''

%%

{MULTI_LN_STR}  {cout<<"GotIt!";}   

%%

int main(int argc, char* argv[]) {

    yyin=fopen("test.txt", "r");

    if (!yyin) {
        cout<<"yyin is NULL"<<endl;
        return 1;
    }

    yylex();
    return 0;
}

适用于输入:

'''This is
a multi
line
string!'''

This is
some random
text

输出结果为:

GotIt!

This is
some random
text

但是对于此输入不起作用(或者,更准确,产生错误的输出):

'''This is
a multi
line
string!'''

This is
some random
text

'''and this
is another
multiline
string''' 

产生:

GotIt!

这是因为我的规则是:
“扫描三个撇号,然后是任何可能的字符,然后是三个撇号”,
而是应该说:
“扫描三个撇号,然后是任何可能的字符除了三个撇号,然后是三个撇号”。

我该怎么做?

2 个答案:

答案 0 :(得分:2)

对于这样的简单否定,构造正则表达式相对容易:

"'''"([^']|'[^']|''[^'])*"'''"

答案 1 :(得分:-2)

似乎支持量程范围{x,y}构造,
所以这很有效,当然比交替更快 如果你有大字符串,这是要走的路。

'''[^']*(?:[']{1,2}[^']+)*'''

 '''
 [^']* 
 (?: [']{1,2} [^']+ )*
 '''

基准:交替与非交替

-----------------------------
'''Set 1 - this
is another
multiline
string'''
 Regex_FAST  (?-xism:'''[^']*(?:[']{1,2}[^']+)*''')
    -took: 0.811201 wallclock secs ( 0.81 usr +  0.00 sys =  0.81 CPU)

'''Set 1 - this
is another
multiline
string'''
 Regex_ALT  (?-xism:'''(?:[^']|'[^']|''[^'])*''')
    -took: 1.4971 wallclock secs ( 1.50 usr +  0.00 sys =  1.50 CPU)

-----------------------------
'''Set 2 - this
is' another
mul'tiline
st''ring'''
 Regex_FAST  (?-xism:'''[^']*(?:[']{1,2}[^']+)*''')
    -took: 0.935462 wallclock secs ( 0.94 usr +  0.00 sys =  0.94 CPU)

'''Set 2 - this
is' another
mul'tiline
st''ring'''
 Regex_ALT  (?-xism:'''(?:[^']|'[^']|''[^'])*''')
    -took: 1.85556 wallclock secs ( 1.86 usr +  0.00 sys =  1.86 CPU)

基准代码:

use strict;
use warnings;
use Benchmark ':hireswallclock';

my ($t0,$t1);
my @dataset = (
   "'''Set 1 - this\nis another\nmultiline\nstring'''",
   "'''Set 2 - this\nis' another\nmul'tiline\nst''ring'''" ); 

my $regex_FAST = qr/'''[^']*(?:[']{1,2}[^']+)*'''/;
my $regex_ALT  = qr/'''(?:[^']|'[^']|''[^'])*'''/;

for my $data (@dataset)
{
    print "-----------------------------\n";

  ## 
    while ($data =~ /$regex_FAST/g){ print "$&\n"; };
    $t0 = new Benchmark;
    for my $cnt (1 .. 500_000) {
        while ($data =~ /$regex_FAST/g){ };
    }
    $t1 = new Benchmark;
    print " Regex_FAST  $regex_FAST\n    -took: ", timestr(timediff($t1, $t0)), "\n\n";

  ## 
    while ($data =~ /$regex_ALT/g){ print "$&\n"; };
    $t0 = new Benchmark;
    for my $cnt (1 .. 500_000) {
        while ($data =~ /$regex_ALT/g){ };
    }
    $t1 = new Benchmark;
    print " Regex_ALT  $regex_ALT\n    -took: ", timestr(timediff($t1, $t0)), "\n\n";
}