如何在以下任何可以处理Python / Haskell / CoffeScript样式缩进的解析器生成器(Parsing Expression Grammar,PEG.js,Citrus)中编写Treetop:
尚未存在的编程语言示例:
square x =
x * x
cube x =
x * square x
fib n =
if n <= 1
0
else
fib(n - 2) + fib(n - 1) # some cheating allowed here with brackets
更新 不要尝试为上面的例子编写解释器。我只对缩进问题感兴趣。另一个例子可能是解析以下内容:
foo
bar = 1
baz = 2
tap
zap = 3
# should yield (ruby style hashmap):
# {:foo => { :bar => 1, :baz => 2}, :tap => { :zap => 3 } }
答案 0 :(得分:23)
Pure PEG无法解析缩进。
但 peg.js 可以。
我做了一个快速而肮脏的实验(受到艾拉巴克斯特关于作弊的评论的启发)。
/* Initializations */
{
function start(first, tail) {
var done = [first[1]];
for (var i = 0; i < tail.length; i++) {
done = done.concat(tail[i][1][0])
done.push(tail[i][1][1]);
}
return done;
}
var depths = [0];
function indent(s) {
var depth = s.length;
if (depth == depths[0]) return [];
if (depth > depths[0]) {
depths.unshift(depth);
return ["INDENT"];
}
var dents = [];
while (depth < depths[0]) {
depths.shift();
dents.push("DEDENT");
}
if (depth != depths[0]) dents.push("BADDENT");
return dents;
}
}
/* The real grammar */
start = first:line tail:(newline line)* newline? { return start(first, tail) }
line = depth:indent s:text { return [depth, s] }
indent = s:" "* { return indent(s) }
text = c:[^\n]* { return c.join("") }
newline = "\n" {}
depths
是一堆缩进。 indent()返回一个缩进标记数组,start()展开数组以使解析器的行为有点像流。
peg.js 为文字生成:
alpha
beta
gamma
delta
epsilon
zeta
eta
theta
iota
这些结果:
[
"alpha",
"INDENT",
"beta",
"gamma",
"INDENT",
"delta",
"DEDENT",
"DEDENT",
"epsilon",
"INDENT",
"zeta",
"DEDENT",
"BADDENT",
"eta",
"theta",
"INDENT",
"iota",
"DEDENT",
"",
""
]
这个解析器甚至可以捕获坏的缩进。
答案 1 :(得分:9)
我认为像这样的缩进敏感语言是上下文敏感的。我相信PEG只能做无背景的语言。
请注意,虽然nalply的回答肯定是正确的,PEG.js可以通过外部状态(即可怕的全局变量)来做到这一点,但它可能是一条走路的危险路径(比全局变量的常见问题更糟)。某些规则最初可以匹配(然后运行其操作),但父规则可能会失败,从而导致操作运行无效。如果在此类操作中更改了外部状态,则最终可能会出现无效状态。这太可怕了,可能导致震颤,呕吐和死亡。有关这方面的一些问题和解决方案,请参阅此处的评论:https://github.com/dmajda/pegjs/issues/45
答案 2 :(得分:7)
所以我们在这里做的缩进是创建类似C风格的块,它们通常有自己的词法范围。如果我正在为这样的语言编写编译器,我想我会尝试让词法分析器跟踪缩进。每次缩进增加时,它都可以插入一个'{'标记。同样,每次减少它都可以插入'}'标记。然后用明确的花括号编写表达式语法来表示词法范围变得更加直接。
答案 3 :(得分:1)
您可以使用语义谓词在Treetop中执行此操作。在这种情况下,您需要一个语义谓词来检测由于另一条具有相同或较小缩进的行的出现而关闭一个空白缩进块。谓词必须从开始行计算缩进,如果当前行的缩进以相同或更短的长度结束,则返回true(块关闭)。因为结束条件是依赖于上下文的,所以不能记住它。 这是我即将添加到Treetop文档中的示例代码。请注意,我已经覆盖了Treetop的SyntaxNode检查方法,以便更容易可视化结果。
grammar IndentedBlocks
rule top
# Initialise the indent stack with a sentinel:
&{|s| @indents = [-1] }
nested_blocks
{
def inspect
nested_blocks.inspect
end
}
end
rule nested_blocks
(
# Do not try to extract this semantic predicate into a new rule.
# It will be memo-ized incorrectly because @indents.last will change.
!{|s|
# Peek at the following indentation:
save = index; i = _nt_indentation; index = save
# We're closing if the indentation is less or the same as our enclosing block's:
closing = i.text_value.length <= @indents.last
}
block
)*
{
def inspect
elements.map{|e| e.block.inspect}*"\n"
end
}
end
rule block
indented_line # The block's opening line
&{|s| # Push the indent level to the stack
level = s[0].indentation.text_value.length
@indents << level
true
}
nested_blocks # Parse any nested blocks
&{|s| # Pop the indent stack
# Note that under no circumstances should "nested_blocks" fail, or the stack will be mis-aligned
@indents.pop
true
}
{
def inspect
indented_line.inspect +
(nested_blocks.elements.size > 0 ? (
"\n{\n" +
nested_blocks.elements.map { |content|
content.block.inspect+"\n"
}*'' +
"}"
)
: "")
end
}
end
rule indented_line
indentation text:((!"\n" .)*) "\n"
{
def inspect
text.text_value
end
}
end
rule indentation
' '*
end
end
这是一个小小的测试驱动程序,因此您可以轻松地尝试:
require 'polyglot'
require 'treetop'
require 'indented_blocks'
parser = IndentedBlocksParser.new
input = <<END
def foo
here is some indented text
here it's further indented
and here the same
but here it's further again
and some more like that
before going back to here
down again
back twice
and start from the beginning again
with only a small block this time
END
parse_tree = parser.parse input
p parse_tree
答案 4 :(得分:0)
我知道这是一个旧线程,但我只是想在答案中添加一些PEGjs代码。这段代码将解析一段文字和#34; nest&#34;它变成了一种&#34; AST-ish&#34;结构体。它只有一个深,它看起来很丑,而且它并没有真正使用返回值来创建正确的结构,但保留了语法的内存树,它将在最后返回。这可能会变得笨拙并导致一些性能问题,但至少它会做它本应该做的事情。
注意:确保您有标签而不是空格!
{
var indentStack = [],
rootScope = {
value: "PROGRAM",
values: [],
scopes: []
};
function addToRootScope(text) {
// Here we wiggle with the form and append the new
// scope to the rootScope.
if (!text) return;
if (indentStack.length === 0) {
rootScope.scopes.unshift({
text: text,
statements: []
});
}
else {
rootScope.scopes[0].statements.push(text);
}
}
}
/* Add some grammar */
start
= lines: (line EOL+)*
{
return rootScope;
}
line
= line: (samedent t:text { addToRootScope(t); }) &EOL
/ line: (indent t:text { addToRootScope(t); }) &EOL
/ line: (dedent t:text { addToRootScope(t); }) &EOL
/ line: [ \t]* &EOL
/ EOF
samedent
= i:[\t]* &{ return i.length === indentStack.length; }
{
console.log("s:", i.length, " level:", indentStack.length);
}
indent
= i:[\t]+ &{ return i.length > indentStack.length; }
{
indentStack.push("");
console.log("i:", i.length, " level:", indentStack.length);
}
dedent
= i:[\t]* &{ return i.length < indentStack.length; }
{
for (var j = 0; j < i.length + 1; j++) {
indentStack.pop();
}
console.log("d:", i.length + 1, " level:", indentStack.length);
}
text
= numbers: number+ { return numbers.join(""); }
/ txt: character+ { return txt.join(""); }
number
= $[0-9]
character
= $[ a-zA-Z->+]
__
= [ ]+
_
= [ ]*
EOF
= !.
EOL
= "\r\n"
/ "\n"
/ "\r"