用于解析类似汇编指令的正则表达式

时间:2013-09-29 10:33:25

标签: c# regex parsing assembly abstract-syntax-tree

介绍有点冗长,所以请耐心等待。 :)

我正在为汇编程序编写的大型源文件编写一个简单的基于正则表达式的解析器。这些说明大多只是移动,添加,减去和跳转,但它是一个非常大的文件,我需要移植到两种不同的语言,我太懒了,不能手动操作。这是要求,我做不了多少(所以请不要回答“你为什么不简单地使用ANTLR”这样的东西。)

所以,在我做了一些预处理之后(我已经完成了这一部分:替换了定义和宏并删除了冗余的空格和注释),我现在基本上必须逐行读取文件并将一行或多行解析成“中间行” “指令,我将用它来产生或多或少的1对1等价物(使用实际的整数算术和一堆GOTO)。

所以,假设我可以拥有所有这些不同的寻址模式:

Addressing mode depends on the format of the instruction

我可以采用两种不同的方式:

  1. 有一个MOV正则表达式可以处理所有这些情况,或
  2. 为每种指令类型启用多个MOV正则表达式。这种方法的问题在于我必须非常仔细地设计每个正则表达式以避免任何歧义。而且似乎会有很多重复,因为源和目标操作数共享许多寻址模式。
  3. 我的问题是:如果我对所有指令都有一个正则表达式,那么我应该如何指定我的组和捕获以便能够简单地区分不同的模式?

    或者我只是抓住所有内容然后在初始匹配后处理源/目标地址?

    E.g。一个相当简单的匹配所有正则表达式将是:

    ^MOV\s+(?<dest>[^\s,]+)[\s,]*(?<src>[^\s,]+)$
    

    (分成带注释的多行):

    ^MOV              (?#instruction)
    \s+               (?#some whitespace)
    (?<dest>[^\s,]+)  (?#match everything except whitespace and comma)
    \s*,\s*           (?#match comma, allow some whitespace)
    (?<src>[^\s,]+)   (?#match everything except whitespace and comma)$
    

    所以,我当然可以这样做,然后分别处理destsrc组。但是,创建一个讨厌的复杂正则表达式以匹配下表中的所有情况会更好吗?在这种情况下,我不确定如何解释这些捕获以了解匹配的寻址模式。

    我正在使用C#,如果这有任何区别。

2 个答案:

答案 0 :(得分:1)

这是一个正如你所想要的那样的正则表达式(你必须编辑实际的数据表格;即代替所有的寄存器标签ax,bx,......我只是使用'reg'等)

 (?<Opt1>MOV\s*Rw,\sRw)
|(?<Opt2>MOV\s*Rw,\s\#data4)
|(?<Opt3>MOV\s*Rw,\s\#data16)
|(?<Opt4>MOV\s*Rw,\s\[Rw\])
|(?<Opt5>MOV\s*Rw,\s\[Rw\+\])
|(?<Opt6>MOV\s*\[Rw\],\sRw)
|(?<Opt7>MOV\s*\[-Rw\],\sRw)
|(?<Opt8>MOV\s*\[Rw\],\s\[Rw\])
|(?<Opt9>MOV\s*\[Rw\+\],\s\[Rw\])
|(?<OptA>MOV\s*\[Rw\],\s\[Rw\+\]) 

使用此数据:

MOV Rw, Rw
MOV Rw, #data4
MOV Rw, #data16
MOV Rw, [Rw]
MOV Rw, [Rw+]
MOV [Rw], Rw
MOV [-Rw], Rw
MOV [Rw], [Rw]
MOV [Rw+], [Rw]
MOV [Rw], [Rw+]

RegexBuddy生成这个:

Match 1:    MOV Rw, Rw       0      10
Group "Opt1":   MOV Rw, Rw       0      10
Group "Opt2" did not participate in the match
Group "Opt3" did not participate in the match
Group "Opt4" did not participate in the match
Group "Opt5" did not participate in the match
Group "Opt6" did not participate in the match
Group "Opt7" did not participate in the match
Group "Opt8" did not participate in the match
Group "Opt9" did not participate in the match
Group "OptA" did not participate in the match
Match 2:    MOV Rw, #data4      12      14
Group "Opt1" did not participate in the match
Group "Opt2":   MOV Rw, #data4      12      14
Group "Opt3" did not participate in the match
Group "Opt4" did not participate in the match
Group "Opt5" did not participate in the match
Group "Opt6" did not participate in the match
Group "Opt7" did not participate in the match
Group "Opt8" did not participate in the match
Group "Opt9" did not participate in the match
Group "OptA" did not participate in the match
Match 3:    MOV Rw, #data16     28      15
Group "Opt1" did not participate in the match
Group "Opt2" did not participate in the match
Group "Opt3":   MOV Rw, #data16     28      15
Group "Opt4" did not participate in the match
Group "Opt5" did not participate in the match
Group "Opt6" did not participate in the match
Group "Opt7" did not participate in the match
Group "Opt8" did not participate in the match
Group "Opt9" did not participate in the match
Group "OptA" did not participate in the match
Match 4:    MOV Rw, [Rw]        45      12
Group "Opt1" did not participate in the match
Group "Opt2" did not participate in the match
Group "Opt3" did not participate in the match
Group "Opt4":   MOV Rw, [Rw]        45      12
Group "Opt5" did not participate in the match
Group "Opt6" did not participate in the match
Group "Opt7" did not participate in the match
Group "Opt8" did not participate in the match
Group "Opt9" did not participate in the match
Group "OptA" did not participate in the match
Match 5:    MOV Rw, [Rw+]       59      13
Group "Opt1" did not participate in the match
Group "Opt2" did not participate in the match
Group "Opt3" did not participate in the match
Group "Opt4" did not participate in the match
Group "Opt5":   MOV Rw, [Rw+]       59      13
Group "Opt6" did not participate in the match
Group "Opt7" did not participate in the match
Group "Opt8" did not participate in the match
Group "Opt9" did not participate in the match
Group "OptA" did not participate in the match
Match 6:    MOV [Rw], Rw        74      12
Group "Opt1" did not participate in the match
Group "Opt2" did not participate in the match
Group "Opt3" did not participate in the match
Group "Opt4" did not participate in the match
Group "Opt5" did not participate in the match
Group "Opt6":   MOV [Rw], Rw        74      12
Group "Opt7" did not participate in the match
Group "Opt8" did not participate in the match
Group "Opt9" did not participate in the match
Group "OptA" did not participate in the match
Match 7:    MOV [-Rw], Rw       88      13
Group "Opt1" did not participate in the match
Group "Opt2" did not participate in the match
Group "Opt3" did not participate in the match
Group "Opt4" did not participate in the match
Group "Opt5" did not participate in the match
Group "Opt6" did not participate in the match
Group "Opt7":   MOV [-Rw], Rw       88      13
Group "Opt8" did not participate in the match
Group "Opt9" did not participate in the match
Group "OptA" did not participate in the match
Match 8:    MOV [Rw], [Rw]     103      14
Group "Opt1" did not participate in the match
Group "Opt2" did not participate in the match
Group "Opt3" did not participate in the match
Group "Opt4" did not participate in the match
Group "Opt5" did not participate in the match
Group "Opt6" did not participate in the match
Group "Opt7" did not participate in the match
Group "Opt8":   MOV [Rw], [Rw]     103      14
Group "Opt9" did not participate in the match
Group "OptA" did not participate in the match
Match 9:    MOV [Rw+], [Rw]    119      15
Group "Opt1" did not participate in the match
Group "Opt2" did not participate in the match
Group "Opt3" did not participate in the match
Group "Opt4" did not participate in the match
Group "Opt5" did not participate in the match
Group "Opt6" did not participate in the match
Group "Opt7" did not participate in the match
Group "Opt8" did not participate in the match
Group "Opt9":   MOV [Rw+], [Rw]    119      15
Group "OptA" did not participate in the match
Match 10:   MOV [Rw], [Rw+]    136      15
Group "Opt1" did not participate in the match
Group "Opt2" did not participate in the match
Group "Opt3" did not participate in the match
Group "Opt4" did not participate in the match
Group "Opt5" did not participate in the match
Group "Opt6" did not participate in the match
Group "Opt7" did not participate in the match
Group "Opt8" did not participate in the match
Group "Opt9" did not participate in the match
Group "OptA":   MOV [Rw], [Rw+]    136      15

答案 1 :(得分:1)

您正在发现当您尝试将词法分析器带到解析器的工作时会发生什么。我认为你的大部分困难都是试图用正则表达式做太多。

是的,我打算建议像ANTLR或等效的解析器。

如果你走那条路,你会写出很多小的regexp来识别标记(“MOV”,“#”,“[”,...)然后你会写一个定义如何定义的语法这些构成了指令。如果不出意外,这使得简单地编写解析部分变得容易得多。

您可以看到汇编程序代码looks like。 (使用ANTLR以外的系统,但想法是一样的)。这写起来非常简单,并且没有关于尝试编写一个正则表达式来统治它们的痛苦。 [我在一个晚上做了那个例子,用它解析了一大堆来源]。

您不清楚“端口”的含义。据推测,如果不是另一种机器架构,你将使用另一种汇编语法。为了做到这一点,你需要访问各种指令部分(所有可能的MOV指令都不会给你一个正则表达式)。这是解析和生成树木的美妙之处:所有这些部分都暴露给你,嵌入它们所属的结构中。您甚至可以从多个汇编语言语句生成单个指令,因为该树包含整个程序。 (对于具有1 GB RAM的系统而言,相当大的并不意味着树大小。)