Question

现在已经有一段时间了，我试图让自己在Javascript中为org-mode编写一个解析器。我在解析大纲时没有遇到任何麻烦（我在几分钟内就完成了），但解析实际内容要困难得多，例如，我在使用叠加列表时遇到了麻烦。

* This is a heading
  P1 Start a paragraph here but since it is the first indentation level
the paragraph may have a lower indentation on the next line
    or a greater one for that matter.

  + LI1.1 I am beginning a list here
  + LI1.2 Here begins another list item
    which continues here
      and also here
  P2 but is broken here (this line becomes a paragraph
  outside of the first list).
  + LI2.1 P1 Second list item.
    - LI2.1.1 Inner list with a simple item
    - LI2.1.2 P1 and with an item containing several paragraphs.
      Here is the second line in the item, and now

      LI2.1.2 P2 I begin a new paragraph still in the same item. 
        The indentation can be only higher
    LI2.1 P2 but if the indentation is lower, it breaks the item, 
    (and the whole list), and this is a paragraph in the LI2.1
    list item

    - LI 2.2.1 You get the picture
  P3 Just plain text outside of the list.

（在上面的示例中，PX和LIX.Y仅用于明确显示新块的开头，它们不会出现在实际文档中。P代表对于列表项，段落和LI。在HTML世界中，PX将是<p>标记的开头。编号只是为了帮助跟踪列表的嵌套和更改。）

我想知道解析这种重要的白色空间叠加块的策略，显然我可以逐行解析而没有任何回溯或什么都没有，所以它必须非常简单，但由于某种原因我无法设法做到这一点。我试图从Markdown解析器获得灵感，或者那些本应具有类似重叠特征的东西，但它们在我看来（对于我看到的那些）非常hacky，充满了正则表达式，我希望我能写出更清洁的东西（org） - 当你想到它时，模式“语法”是非常巨大的，它会逐渐增长，我希望整个事情可以维护并允许轻松插入新功能。）

任何有解析此类事情经验的人都可以给我一些指示吗？

Answer 1

我喜欢解析器和编译器理论，因此我编写了一个小型解析器（手动），可以将您的示例代码段解析为 XML DOM Document对象。应该可以对其进行修改，以便生成其他类型的树结构，如自定义AST（抽象语法树）。

我试图让代码易于阅读，以便您可以看到这样的解析器是如何工作的。

问我是否需要更多解释，或者希望我稍微修改一下。

以示例代码段作为输入，语句result = new OrgModParser().parse(input); result.xml返回：

<org-mode-document indentLevel="-1">
    <section indentLevel="0">
        <header indentLevel="0">This is a heading</header>
            <paragraph indentLevel="1">P1 Start a paragraph here but since it is the first indentation level the paragraph may have a lower indentation on the next line or a greater one for that matter.</paragraph>
            <list indentLevel="1">
                <list-item indentLevel="1">
                    <paragraph indentLevel="2">LI1.1 I am beginning a list here</paragraph>
                </list-item>
                <list-item indentLevel="1">
                    <paragraph indentLevel="2">LI1.2 Here begins another list item which continues here and also here</paragraph>
                </list-item>
            </list>
        <paragraph indentLevel="1">P2 but is broken here (this line becomes a paragraph outside of the first list).</paragraph>
        <list indentLevel="1">
            <list-item indentLevel="1">
                <paragraph indentLevel="2">LI2.1 P1 Second list item.</paragraph>
                <list indentLevel="2">
                    <list-item indentLevel="2">
                        <paragraph indentLevel="3">LI2.1.1 Inner list with a simple item</paragraph>
                    </list-item>
                    <list-item indentLevel="2">
                        <paragraph indentLevel="3">LI2.1.2 P1 and with an item containing several paragraphs. Here is the second line in the item, and now</paragraph>
                        <paragraph indentLevel="3">LI2.1.2 P2 I begin a new paragraph still in the same item.  The indentation can be only higher</paragraph>
                    </list-item>
                </list>
                <paragraph indentLevel="2">LI2.1 P2 but if the indentation is lower, it breaks the item,  (and the whole list), and this is a paragraph in the LI2.1 list item</paragraph>
                <list indentLevel="2">
                    <list-item indentLevel="2">
                        <paragraph indentLevel="3">LI2.2.1 You get the picture</paragraph>
                    </list-item>
                </list>
            </list-item>
        </list>
        <paragraph indentLevel="1">P3 Just plain text outside of the list.</paragraph>
    </section>
</org-mode-document>

代码：

/*
 * File: orgmodparser.js
 * Basic usage: var object = new OrgModeParser().parse(input); 
 * Works on: JScript and JScript.Net.
 * - For other JavaScript platforms, just replace or override the .createRoot() method
 */

OrgModeParser = function (options) {
    if (typeof options == "object") {
        for (var i in options) {
            this[i] = options[i];
        }
    }
}

OrgModeParser.prototype = {

    "INDENT_WIDTH" :    2,  // Two spaces
    "LINE_SEPARATOR" :  "\r\n",

    /*
     * Each line in the input will be matched against this regexp.
     * Only spaces are allowed as indentation characters.
     * The symbols '*', '+' and '-' will be recognized, but only if they are followed by at least one space.
     * Add other symbols in this regexp if you want the parser to recognize them
     */
    "re" :    /^( *)([\+\-\*] +)?(.*)/,

    // This function must return a valid XML DOM document object
    createRoot :    function () {
        var err, progIDs = ["Msxml2.DOMDocument.6.0", "Msxml2.DOMDocument.5.0", "Msxml2.DOMDocument.4.0", "Msxml2.DOMDocument.3.0", "Msxml2.DOMDocument.2.0", "Msxml2.DOMDocument.1.0", "Msxml2.DOMDocument"];
        for (var i = 0; i < progIDs.length; i++) {
            try {
                return new ActiveXObject(progIDs[i]);
            }
            catch (err) {
            }
        }
        alert("Org-mode parser - Error - Failed to instantiate root object");
        return null;
    },

    parse : function (text) {

        function createNode (tagName, text) {
            var node = root.createElement(tagName);
            node.setAttribute("indentLevel", level);
            if (text) {
                var textNode = root.createTextNode(text);
                node.appendChild(textNode);
            }
            return node;
        }

        function getContainer () {
            if (lastNode.tagName == "section") { return lastNode; }
            var anc = lastNode.parentNode;
            while (anc) {
                if (modifier == "+" || modifier == "-") {
                    if (anc.getAttribute("indentLevel") == level && anc.tagName == "list") { return anc; }
                }
                if (anc.getAttribute("indentLevel") < level && anc.tagName != "paragraph") { return anc; }
                anc = anc.parentNode;
            }
            alert("Org-mode parser - Internal error at line: "+i);return null;
        }

        if (typeof text != "string") { alert("Org-mode - Type error - Input must be of type 'string'"); return null; }

        var body;
        var content;     // The text of the current line, without its indentation and modifier
        var lastNode;    // The node being processed
        var indent;      // The indentation of the current line
        var isAfterDubbleLineBreak;  // Indicates if the current line follows a dubble line break
        var line;        // The current line being processed
        var level;       // The current indentation level; given by indent.length / this.INDENT_WIDTH. Not to confuse with the nesting level 
        var lines;       // Array. Empty lines are included.
        var match;
        var modifier;    // This can be "*", "+", "-" or ""
        var root;

        isAfterDubbleLineBreak = false;
        level = -1;      // Indentation level is -1 initially; it will be 0 for the first "*"-bloc
        lines = text.split(this.LINE_SEPARATOR);
        root = this.createRoot();
        body = root.appendChild(createNode("org-mode-document"));
        lastNode = body;

        for (var i = 0; i < lines .length; i++) {
            line = lines[i];
            match = line.match(this.re);
            if (match === null) { alert("org-mode parse error at line: " + i); return null; }
            indent = match[1];
            level = indent.length / this.INDENT_WIDTH;
            modifier = match[2] && match[2].charAt(0);
            content = match[3];

            // These conditions tell the parser what to do when encountering a line with a given modifer
            if (content === "") { dubbleLineBreak(); continue; }
            else if (modifier == "+" || modifier == "-") { plus(); }
            else if (modifier == "*") { star(); }
            else if (modifier == "+") { plus(); }
            else if (modifier == "-") { minus(); }
            else if (modifier == "") { noModifier(); }
            isAfterDubbleLineBreak = false;
        }
        return root;


        function star() {
            // The '*' modifier is not allowed on an indented line
            if (indent) { alert("Org-mode parse error: unexpected '*' symbol at line " + i); return null; }
            lastNode = body.appendChild(createNode("section"));
            // The div remains the current node
            lastNode.appendChild(createNode("header", content));
        }

        function plus() {
            var container = getContainer();
            var tn = container.tagName;
            if (tn == "section" || tn == "list-item") {
                lastNode = container.appendChild(createNode("list"));
                lastNode = lastNode.appendChild(createNode("list-item"));
                lastNode = lastNode.appendChild(createNode("paragraph", content));
            } else if (tn == "list") {
                lastNode = container.appendChild(createNode("list-item"));
                lastNode = lastNode.appendChild(createNode("paragraph", content));
            }
            else alert("Org-mode parser - Internal error - Bad container tag name: " + tn);
            lastNode.setAttribute("indentLevel", Number(lastNode.getAttribute("indentLevel")) + 1);
        }

        function minus() { plus(); }

        function noModifier() {
            if (lastNode.tagName == "paragraph" && !isAfterDubbleLineBreak && (lastNode.getAttribute("indentLevel") == 1 || level >= lastNode.getAttribute("indentLevel"))) {
                lastNode.childNodes[0].appendData(" " + content);
            } else {
                var container = getContainer();
                lastNode = container.appendChild(createNode("paragraph", content));
            }
        }

        function dubbleLineBreak() {
            while (lines[i+1] && /^\s*$/.test(lines[i+1])) { i++; }
            isAfterDubbleLineBreak = true;
        }

    }
};

Answer 2

就像我在评论中所说的那样，解析这个问题会很麻烦，就像许多类似Wiki的语言一样。

如果您要编写语法并让解析器生成器为您创建解析器，而不是手写解析器，则有各种选项。列举几个：

ANTLR可以生成JavaScript源代码。有关JavaScript演示，请参阅：antlr3 - Generating a Parse Tree
Jison
peg.js

我知道ANTLR可以做到这一点，但它不会是微不足道的，最重要的是，你需要掌握这个工具（将需要一些时间！）。我没有花太多时间在其他两个工具上，但怀疑他们会用这种讨厌的语言来完成工作。

使用手写解析器可以快速启动，但调试，增强或重写它将很困难。编写语法并让解析器生成器为您创建解析器将导致更容易调试，增强和重写解析器（通过语法），但是您需要花费（相当）一些时间学习使用该工具。

当然，如果写得正确的话，手写的解析器（很可能）会比生成的解析器更快。但是它们之间的差异可能只有很大的来源才会引起注意。

抱歉，我没有一般策略如何使用手写解析器处理此问题。

祝你好运！

Answer 3

有一个Javascript org-mode解析器可用here。

在Javascript中解析组织模式文件

3 个答案: