通过语法解析为AST(或.y + .lang => xml)的工具

时间:2011-03-28 23:40:05

标签: parsing compiler-construction bison

给定一个词法分析器定义文件,一个语法文件(比如postgresql .y.l来自它的源代码树的flex和bison程序),以及由那些词法分析器和解析器定义的文件(例如, SQL查询)以某种标准形式获取AST(例如,XML的JSON)。

此工具最重要的方面是 - 输入格式的灵活性。在我的例子中,我可以在ANTLR中重新创建postgres SQL语法 - 但我不想这样做。我宁愿只使用postgres正在使用的东西。因此即使.y文件包含的不仅仅是解析规则 - 我正在寻找的工具也能够通过微小的修改来理解它们。

是否有通用工具可以做到这一点?

这是一个与我想象的工具ly2xml的命令行会话:

$ git clone git://postgres-git-url pg
$ find pg -iname *.[yl] -exec cp '{}' ~/ \;
$ echo 'SELECT * FROM (SELECT 1)'|ly2xml -parser=*.y -lexer=*.l - -O-
<SELECT>
  <ARGS>*</ARGS>
  <FROM>
    <SELECT><ARGS>1</ARGS></SELECT>
  </FROM>
</SELECT>

(请注意-表示从标准输入读取,-O-表示它写入标准输出。)

1 个答案:

答案 0 :(得分:3)

很好的想法。你假设一个或多个:

 a) that each tool that has a grammar, uses a canonical parsing engine type (e.g., everybody uses bison)
 b) that there is some parsing tool that understands the zillion grammar specification schemes that exist
 c) that whatever the parser is, it will parse language fragments (perhaps well formed).

a)显然是假的。我从未见过b)。实际上没有一个解析引擎做c);他们只能解析“完整的程序”。

你唯一的希望IMHO是使用具有大量经过良好测试的语言定义的解析器生成器。

ANTLR可以说是一个;它肯定有很多贡献的语言定义。而且它们在一个地方都是可以找到的。但是,我知道不会做语言碎片。如果它具有所有解析树的XML导出,则存在疑问。

Bison可以说是一个;使用Bison构建了大量的语言处理器。但是这些定义分散在各处,收集它们将非常困难。也不做语言片段。很确定它没有XML导出。

我们的DMS Software Reengineering Toolkit可以说是一个。有很多语言定义。他们都被收集在一个地方(我们公司)。它确实为每个解析生成AST,并且具有内置的XML导出。 DMS还可以解析任何语言的非终结语言。

DMS可以很好地模拟您的示例,给定DMS .lex,.atg(“属性语法”)和兼容的源文件。

以下是使用XML导出的DMS词法分析器/解析器构建和运行,用于在Algebra as DMS Domain找到的代数语法 (该示例中间的 ++ XML 是解析步骤被告知导出XML):

C:\DMS\Domains\Algebra\Tools\Parser\Source>make
perl /cygdrive/c/DMS/Executables/MakeDMSTool Algebra -lexer
MakeDMSTool: Selected domain "Algebra".
LexerGenerator V2.1a
Copyright (c) 1999-2010 Semantic Designs, Inc.; All Rights Reserved
Parsing lexical specification ...
Processing mode Algebra ...
Exiting with final status 0
perl /cygdrive/c/DMS/Executables/MakeDMSTool Algebra -tool %Temporaries
MakeDMSTool: Selected domain "Algebra".
Using attribute grammar in "/cygdrive/c/DMS/Domains/Algebra/Tools/Parser/Source/Syntax/Algebra.atg"
AttributeEvaluatorGenerator V3.0
Copyright (c) 1999-2010 Semantic Designs, Inc.; All Rights Reserved
Parsing attribute grammar ...
Generating attribute evaluator(s) ...
Exiting with final status 0

rm -rf /cygdrive/c/DMS/Domains/Algebra/Tools/%Temporaries
perl /cygdrive/c/DMS/Executables/MakeDMSTool Algebra -prettyprinter
MakeDMSTool: Selected domain "Algebra".
PrettyPrinterGenerator V2.0
Copyright (c) 1999-2010 Semantic Designs, Inc.; All Rights Reserved

Parsing pretty printer specification ...
Generating pretty printer ...
Exiting with final status 0

AttributeEvaluatorGenerator V3.0
Copyright (c) 1999-2010 Semantic Designs, Inc.; All Rights Reserved
Parsing attribute grammar ...
Generating attribute evaluator(s) ...
......................

Exiting with final status 0
cd /cygdrive/c/DMS/Domains/Algebra/Tools/Parser/Source/\%Generated; \
    perl /cygdrive/c/DMS/Executables/MakeDMSTool Algebra -weave-preserve-productions %PreserveProductions.*.par
MakeDMSTool: Selected domain "Algebra".
perl /cygdrive/c/DMS/Executables/MakeDMSTool Algebra -parser
MakeDMSTool: Selected domain "Algebra".
export PARLANSEINCLUDEDIRECTORIES=`perl -e '($_ = $ARGV[0].";/cygdrive/c/DMS/Domains/PARLANSE/Library/Arrays;/cygdrive/c/DMS/Domains
/PARLANSE/Library/Bags;/cygdrive/c/DMS/Domains/PARLANSE/Library/HashTables;/cygdrive/c/DMS/Domains/PARLANSE/Library/Pipes;/cygdrive/
c/DMS/Domains/PARLANSE/Library/Sequences;/cygdrive/c/DMS/Domains/PARLANSE/Library/Sets;/cygdrive/c/DMS/Domains/PARLANSE/Library/Stac
ks;/cygdrive/c/DMS/Domains/PARLANSE/Library/Utilities;/cygdrive/c/DMS/Domains/PARLANSE/Library/Algorithms/Source;/cygdrive/c/DMS/Dom
ains/PARLANSE/Library/Booleans/Source;/cygdrive/c/DMS/Domains/PARLANSE/Library/Characters/Source;/cygdrive/c/DMS/Domains/PARLANSE/Li
brary/Graphics/Source;/cygdrive/c/DMS/Domains/PARLANSE/Library/HashTrees/Source;/cygdrive/c/DMS/Domains/PARLANSE/Library/Numbers/Sou
rce;/cygdrive/c/DMS/Domains/PARLANSE/Library/References/Source;/cygdrive/c/DMS/Domains/PARLANSE/Library/SQL/Source;/cygdrive/c/DMS/D
omains/PARLANSE/Library/Streams/Source;/cygdrive/c/DMS/Domains/PARLANSE/Library/SuffixTrees/Source;/cygdrive/c/DMS/Domains/PARLANSE/
Library/System/Source;/cygdrive/c/DMS/Domains/PARLANSE/Library/Search/Source;/cygdrive/c/DMS/Domains/PARLANSE/Library/TestSupport/So
urce") =~ s!//(.)/!$1:/!g; $_ =~ s!/cygdrive/(.)/!$1:/!g; print $_' "/cygdrive/c/DMS/Domains/Algebra/Tools/Parser/Source;/cygdrive/c
/DMS/Domains/Algebra/Tools/Parser/Source/Components;/cygdrive/c/DMS/Domains/Algebra/Tools/Parser/Source/%Generated;/cygdrive/c/DMS/D
omains/DMSStringGrammar/Tools/DomainParser/Source;/cygdrive/c/DMS/Domains/Algebra/Tools/Lexer/Source;/cygdrive/c/DMS/Domains/Algebra
/Tools/Lexer/Source/%Generated;/cygdrive/c/DMS/Domains/DMSLexical/Tools/DomainLexer/Source;/cygdrive/c/DMS/Infrastructure/HyperGraph
/Source;/cygdrive/c/DMS/Domains"`; \
    cd `echo /cygdrive/c/DMS/Domains/Algebra/Tools/Parser/Source`; \
    nice /cygdrive/c/DMS/Domains/PARLANSE/Tools/Compiler/p0c.exe  DomainParser.par
PARLANSE0 Compiler V19.16.40
Semantic Designs, Inc. *** Confidential Information
128/485/133408 smallest/average/largest activation record/grain stack space required.
Largest stack space required by function at Line    1533
 in file FFIModule.par
89 grains.
3775 functions/procedures.
223447 lines of source code read.
7160772 bytes of object code.
No errors detected.
mv -f /cygdrive/c/DMS/Domains/Algebra/Tools/Parser/Source/DomainParser.P0B /cygdrive/c/DMS/Domains/Algebra/Tools/Parser/DomainParser
.P0B

C:\DMS\Domains\Algebra\Tools\Parser\Source>run ../DomainParser ++XML C:\DMS\Domains\Algebra\Tools\Lexer\TestCase\algebraformula.txt
Domain Parser for Algebra 2.3.3
Copyright (C) Semantic Designs 1996-2010; All Rights Reserved
31 tree nodes in tree.
<DMSForest>
 <tree node="formula" type="1" domain="1" id="10qx0" parents="0" line="1" column="1" file="1">
  <tree node="product" type="4" domain="1" id="10qwx" line="1" column="1" file="1">
   <tree node="term" type="10" domain="1" id="10qwy" line="1" column="1" file="1">
<tree node="'D'" type="19" domain="1" id="10qw5" literal="0" line="1" column="1" file="1"/>
<tree node="'['" type="20" domain="1" id="10qw6" literal="0" line="1" column="2" file="1"/>
<tree node="formula" type="1" domain="1" id="10qwt" line="1" column="4" file="1">
 <tree node="product" type="4" domain="1" id="10qws" line="1" column="4" file="1">
  <tree node="term" type="9" domain="1" id="10qwr" line="1" column="4" file="1">
   <tree node="'('" type="17" domain="1" id="10qw7" literal="0" line="1" column="4" file="1"/>
   <tree node="formula" type="3" domain="1" id="10qwp" line="1" column="5" file="1">
    <tree node="formula" type="2" domain="1" id="10qwk" line="1" column="5" file="1">
     <tree node="formula" type="1" domain="1" id="10qwf" line="1" column="5" file="1">
      <tree node="product" type="5" domain="1" id="10qwe" line="1" column="5" file="1">
       <tree node="product" type="4" domain="1" id="10qwa" line="1" column="5" file="1">
    <tree node="term" type="7" domain="1" id="10qw9" line="1" column="5" file="1">
     <tree node="VARIABLE" type="15" domain="1" id="10qw8" line="1" column="5" file="1">
      <literal>x</literal>
     </tree>
    </tree>
       </tree>
       <tree node="'*'" type="13" domain="1" id="10qwb" literal="0" line="1" column="7" file="1"/>
       <tree node="term" type="8" domain="1" id="10qwd" line="1" column="8" file="1">
    <tree node="NUMBER" type="16" domain="1" id="10qwc" literal="23" line="1" column="8" file="1"/>
       </tree>
      </tree>
     </tree>
     <tree node="'+'" type="11" domain="1" id="10qwg" literal="0" line="1" column="10" file="1"/>
     <tree node="product" type="4" domain="1" id="10qwj" line="1" column="12" file="1">
      <tree node="term" type="7" domain="1" id="10qwi" line="1" column="12" file="1">
       <tree node="VARIABLE" type="15" domain="1" id="10qwh" line="1" column="12" file="1">
    <literal>y</literal>
       </tree>
      </tree>
     </tree>
    </tree>
    <tree node="'-'" type="12" domain="1" id="10qwl" literal="0" line="1" column="13" file="1"/>
    <tree node="product" type="4" domain="1" id="10qwo" line="1" column="14" file="1">
     <tree node="term" type="7" domain="1" id="10qwn" line="1" column="14" file="1">
      <tree node="VARIABLE" type="15" domain="1" id="10qwm" line="1" column="14" file="1">
       <literal>z</literal>
      </tree>
     </tree>
    </tree>
   </tree>
   <tree node="')'" type="18" domain="1" id="10qwq" literal="0" line="1" column="15" file="1"/>
  </tree>
 </tree>
</tree>
<tree node="','" type="21" domain="1" id="10qwu" literal="0" line="1" column="16" file="1"/>
<tree node="VARIABLE" type="15" domain="1" id="10qwv" line="1" column="18" file="1">
 <literal>x</literal>
</tree>
<tree node="']'" type="22" domain="1" id="10qww" literal="0" line="1" column="19" file="1"/>
   </tree>
  </tree>
 </tree>
 <FileIndex>
  <File index="1">C:/DMS/Domains/Algebra/Tools/Lexer/TestCase/algebraformula.txt</File>
 </FileIndex>
 <DomainIndex>
  <Domain index="1">Algebra</Domain>
 </DomainIndex>
</DMSForest>
Exiting with final status 0

C:\DMS\Domains\Algebra\Tools\Parser\Source>

如果确实想要一个理解许多语法符号的引擎,那么使用DMS构建这样的引擎可能是最简单的。简单地将每个语法形式(例如,ANTLR或bison)定义为DSL到DMS,使用DMS解析特定语法形式实例(例如,ANLTR bnf实例),应用DMS重写规则将其转换为DMS语法,然后构建DMS解析器。 (你也必须对词法分析器做同样的事情。)