Question

作为Flex-Bison的首发我已经遇到了第一个障碍，似乎无法找到解决方法。

问题陈述：对于给定的html / xml文件，需要在标记之间提取数据。我已经阅读了有关SO的相关问题，但似乎并没有达到这个问题的最佳位置

（由于这是为了学习如何使用flex-bison，我不想切换到使用任何其他语言/工具）。

输入文件包含要提取的以下字段：

<!DOCTYPE html>
<html charset="utf-8" lang="en">
<head>
<meta content="text/html; charset=UTF-8" http-equiv="content-type">
<meta content="text/css" http-equiv="Content-Style-Type">
<script src="/commd/jquery.nivo.slider.pack.js"></script>
<link rel="stylesheet" type="text/css" href="/fonts/stylesheet.css"/>
<link rel="stylesheet" type="text/css" href="/commd/stylesheet.css"/>


<!--<legend> DATA TO BE EXTRACTED</legend>--> //relevant data between <legend> tag

我写了以下扫描仪test.l

%option noyywrap
%{
#include "parser.tab.h"
%}
%%
"<!--<legend>"  {return name1;}
(.*?)   {yylval.sval=strdup(yytext); return name2;}
"<\/legend>" {return name3;}
%%

和解析器代码parser.y

%{
#include<stdio.h>
#include<string.h>
#include<stdlib.h>
#define YYERROR_VERBOSE
extern int yylex();
extern int yyparse();
extern FILE *yyin;

%}

%union {
    char *sval;
}

%token <sval> name1
%token <sval> name2 
%token <sval> name3

%%
names : name1 name2 name3 { printf("%s\n", $2); }

%%

int main(int argc, char **argv) {

    // open a file handle to a particular file:
    FILE *myfile = fopen(argv[1], "r");
    // make sure it is valid:
    if (!myfile) {
        printf("I can't open file!");
        return 1;
    }
    // set flex to read from it instead of defaulting to STDIN:
    yyin = myfile;

    // parse through the input until there is no more:
    do {
        yyparse();
    } while (!feof(yyin));

}

void yyerror(char *s) {
    printf("EEK, parse error!  Message:%s",s);
    // might as well halt now:
    exit(1);
}

使用makefile进行编译

all: compile_run

compile_run:
    @bison -d parser.y
    @flex test.l
    @gcc parser.tab.c lex.yy.c -lfl -o run

在执行程序时，我收到以下错误：

EEK, parse error! Message:**syntax error, unexpected name2, expecting name1 ***

我理解读取错误，因为name2令牌可以无限匹配，并且它出现在grammmar的预期令牌name1之前。

我的问题是，现在我已经定义了语法，首先查找name1然后是name2，然后是name3标记，为什么会出现此错误。

如果我在扫描仪中只定义了一个令牌name1

<!--<legend>(.*?)<\/legend> {return name1;}

我将获得包含标签的整个字符串。我可以发布过程来获取数据，但我真的认为必须有一个更聪明的方法，我将从这里了解到:)。

Answer 1

您遇到问题的原因是您只为输入文件的一部分定义了规则，并且希望词法分析器和解析器只忽略其余部分。这不是工具的工作方式;他们尝试匹配所有内容，因此您必须为输入数据的每个方面定义所有内容。

我还注意到你的原始lexer文件没有使用flex构建。您的规则顺序错误。您的原始规则集：

%option noyywrap
%{
#include "parser.tab.h"
%}
%%
"<!--<legend>"  {return name1;}
(.*?)   {yylval.sval=strdup(yytext); return name2;}
"<\/legend>" {return name3;}
%%

给出以下错误：

＆＃34; test.l＆＃34;，第8行：警告，规则无法匹配

这是因为flex会按顺序使用规则，并且永远不会返回name3，因为name2的模式也会匹配name3。你明显修复了这个问题，以便能够构建你的测试程序。修复是颠倒规则的顺序，如下所示：

%option noyywrap
%{
#include "parser.tab.h"
%}
%%
"<!--<legend>"  {return name1;}
"<\/legend>" {return name3;}
(.*?)   {yylval.sval=strdup(yytext); return name2;}
%%

在调试中有用的flex（bison）的一个特性是调试模式，这并不奇怪！

如果我们在启用调试模式的情况下运行您的代码，请执行以下操作：

bison -d parser.y
flex -d test.l
gcc  parser.tab.c lex.yy.c -lfl -o run

然后执行程序，我们现在从词法分析器得到有用的输出：

- （缓冲区结束或NUL）
   - 第8行的接受规则（＆＃34;＆＃34;）
  EEK，解析错误！消息：语法错误，意外的名称2，期望name1

您可以看到您的规则(.*?)确实匹配任何文字，但不仅仅在<legend>内，而且在其他地方也是如此。这意味着您的解析器会在看到name2之前看到一系列令牌name2，name2，name1。现在，解析器中的 only 规则在输入中必须以name1标记开头，因此您会收到语法错误。

现在，有几种方法可以解决这个问题。您可以在name2之前更改您的野牛规则以接受大量name1令牌，或者您可以升级整个语法以描述整个XML / HTML。至少你可能想要升级语法以在一个文件中接受几个 <legend>标签！目前你的语法只匹配一个包含一个<legend>结构的文件，而不是其他任何东西 - 记住它不只是忽略其他输入（除非你告诉它）！

重写广义XML结构的语法会更大，但可以做的是指示flex lexer忽略其他输入，以便不返回name2模式。我们只需要为输入数据文件中的其他内容编写模式。我们需要匹配其他XML标记，注释行和空格，并告诉flex忽略它们。

这样做的一个例子可能是：

%{
#include "parser.tab.h"
%}
%%
"<!--<legend>"           {return name1;}
"<\/legend>"             {return name3;}
"<".[^-](.|[ \t])*">"    ; /* Skip other tags */
"//".*[\r\n]+            ; /* Skip comments */
[\r\n\t ]+               ; /* Skip unused whitespace */
(.*?)                    {yylval.sval=strdup(yytext); return name2;}
%%

当我们运行此代码时，我们设法跳过一些不需要的标记：

--(end of buffer or a NUL)
--accepting rule at line 7 ("<!DOCTYPE html>")
--accepting rule at line 9 ("
")
--accepting rule at line 7 ("<html charset="utf-8" lang="en">")
--accepting rule at line 9 ("
")
--accepting rule at line 7 ("<head>")
--accepting rule at line 9 ("
")
--accepting rule at line 7 ("<meta content="text/html; charset=UTF-8" http-equiv
="content-type">")
--accepting rule at line 9 ("
")
--accepting rule at line 7 ("<meta content="text/css" http-equiv="Content-Style-
Type">")
--accepting rule at line 9 ("
")
--accepting rule at line 7 ("<script src="/commd/jquery.nivo.slider.pack.js"></s
cript>")
--accepting rule at line 9 ("
")
--accepting rule at line 7 ("<link rel="stylesheet" type="text/css" href="/fonts
/stylesheet.css"/>")
--accepting rule at line 9 ("
")
--accepting rule at line 7 ("<link rel="stylesheet" type="text/css" href="/commd
/stylesheet.css"/>")
--accepting rule at line 9 ("


")
--accepting rule at line 10 ("<!--<legend> DATA TO BE EXTRACTED</legend>--> //re
levant data between <legend> tag")
EEK, parse error!  Message:syntax error, unexpected name2, expecting name1

我们遇到了另一个问题。谜团是，为什么它不匹配并返回name1？这是由于匹配算法 greedy 并找到与最长令牌匹配的规则。为了解决这个问题，我们必须使用flex的 start condition功能，并且仅在<legend>结构内部匹配一般文本。在匹配XML中使用启动条件时，我们必须小心，因为<符号用于表示状态更改以及引入XML标记。我们可以重新编码切换这样的状态：

%{
#include "parser.tab.h"
%}
%x legends 
%x finishd
%%
<INITIAL>"<!--<legend>"         {BEGIN(legends); return name1;}
<finishd>"</legend>-->"         {BEGIN(INITIAL); return name3;}
<INITIAL>"<".[^-](.|[ \t])*">"  ; /* Skip other tags */
<INITIAL>"//".*[\r\n]+          ; /* Skip comments */
<INITIAL>[\r\n\t ]+             ; /* Skip unused whitespace */
<legends>[^<>]+                 {BEGIN(finishd); yylval.sval=strdup(yytext); return name2;}
%%

然后神奇地我们得到以下内容：

--(end of buffer or a NUL)
--accepting rule at line 9 ("<!DOCTYPE html>")
--accepting rule at line 11 ("
")
--accepting rule at line 9 ("<html charset="utf-8" lang="en">")
--accepting rule at line 11 ("
")
--accepting rule at line 9 ("<head>")
--accepting rule at line 11 ("
")
--accepting rule at line 9 ("<meta content="text/html; charset=UTF-8" http-equiv
="content-type">")
--accepting rule at line 11 ("
")
--accepting rule at line 9 ("<meta content="text/css" http-equiv="Content-Style-
Type">")
--accepting rule at line 11 ("
")
--accepting rule at line 9 ("<script src="/commd/jquery.nivo.slider.pack.js"></s
cript>")
--accepting rule at line 11 ("
")
--accepting rule at line 9 ("<link rel="stylesheet" type="text/css" href="/fonts
/stylesheet.css"/>")
--accepting rule at line 11 ("
")
--accepting rule at line 9 ("<link rel="stylesheet" type="text/css" href="/commd
/stylesheet.css"/>")
--accepting rule at line 11 ("


")
--accepting rule at line 7 ("<!--<legend>")
--accepting rule at line 12 (" DATA TO BE EXTRACTED")
--accepting rule at line 8 ("</legend>-->")
 DATA TO BE EXTRACTED
--accepting rule at line 11 (" ")
--(end of buffer or a NUL)
--accepting rule at line 10 ("//relevant data between <legend> tag
")
--(end of buffer or a NUL)
--EOF (start condition 0

现在，如果您关闭弹性调试，您只需获得所需的输出：

要提取的数据

如果要提取多组数据，您仍然需要更新bison语法;实际上你应该升级整个野牛语法以更好地匹配更多的XML。至少我已经解释过，教程时尚，正在发生的事情以及使它与样本数据集一起工作的一种方法。

使用Flex-Bison在标签之间提取数据

1 个答案: