自然语言C / C ++中解析树的数据结构

时间:2014-01-22 13:52:24

标签: c++ c boost data-structures tree

我想在C / C ++中将句子存储在数据结构中。示例句"This uploads files to a remote machine."表示为:

(TOP
  (S
    (NP (DT This))
    (VP
      (VBZ uploads)
      (NP (NNS files))
      (PP (TO to) (NP (DT a) (JJ remote) (NN machine))))
    (. .)))

here 有没有一种简单的方法在C / C ++中做到这一点?我手动构建树(不使用解析器)。

3 个答案:

答案 0 :(得分:2)

http://opennlp.apache.org/提到的解析器非常复杂。它将一个句子分为名词,动词,介词等。如果你试图用c / c ++重写它,这是一项艰巨的任务。

最好使用解析器并将输出读入c / c ++数据结构。

假设你有解析器的输出,那么输出的格式就相当简单。结构将是这样的:

struct SentencePart {
  SType type;
  // If the type is a basic word type (e.g. NN, JJ, etc)
  char* word;      
  // If the type is a complex sub-sentence.
  struct SentencePart* sentence_part;
};

您可以创建类型的枚举(TOP,S,VP,NP等)。然后,您可以根据扫描的类型读取输入并创建结构。

这是一种非常简单的方法,可能还有其他方法。

答案 1 :(得分:0)

扩展Trenin的答案,我会使用目录式树,其中兄弟是坐标部分,子节点是从属部分:

typedef struct Token Token;

struct Token {
    const char *type;   /* Type of token, cold be an enum */
    const char *data;   /* associated word */
    Token *next;        /* next coordinate token */
    Token *child;       /* eldest subordinate token */
};

然后,您可以设计基于级别的方法将令牌插入到该树中:

root = token_new_level(0, "TOP", NULL);

token_new_level(1,          "S", NULL);
token_new_level(  2,        "NP", NULL);
token_new_level(    3,      "DT", "this");
token_new_level(  2,        "VP", NULL);
token_new_level(    3,      "VPZ", "uploads");
token_new_level(    3,      "NP", NULL);
token_new_level(      4,    "NNS", "files");
token_new_level(    3,      "PP", NULL);
token_new_level(      4,    "TO", "to");
token_new_level(      4,    "NP", NULL);
token_new_level(        5,  "DT", "a");
token_new_level(        5,  "JJ", "remote");
token_new_level(        5,  "NN", "machine");
token_new_level(  2,        ".", ".");

产生:

OP
    S
        NP
            DT this
        VP
            VPZ uploads
            NP
                NNS files
            PP
                TO to
                NP
                    DT a
                    JJ remote
                    NN machine
        . .

以树或平面表示:

 (TOP (S (NP (DT this)) (VP (VPZ uploads) (NP (NNS files)) 
      (PP (TO to) (NP (DT a) (JJ remote) (NN machine)))) (. .)))

名词短语NP和动词短语VP通过next进行协调和链接。名词短语NP和动词短语VP从属于句子S,但只有NP被存储为child的直接S

只有没有孩子的标记附加了单词,因此您可以在C中使用联合或在C ++中使用两个不同的类,例如PhraseWord,它们都从Token继承到改进模型。

答案 2 :(得分:0)

你基本上使用S表达式。 编辑显然,我错过了部分问题。但是,以下技术很容易扩展到其他种类的树木。

我喜欢使用递归Boost变体来处理这些:

using s_expr = boost::make_recursive_variant<std::string, std::vector<boost::recursive_variant_> >::type;
using s_list = std::vector<s_expr>;

当然,部分原因可能是因为我使用了Boost Spirit来轻松解析这些AST。所以,这是我的演示程序,展示了它是如何使用的。

查看 Live on Coliru

测试程序显示了如何解析您显示的样本以及如何在代码中构造等效的AST。请注意,断言证明两者都产生完全相同的表达式树:

int main()
{
    s_expr parsed = parse_s_expr(
            "(TOP\n"
            "  (S\n"
            "    (NP (DT This))\n"
            "    (VP\n"
            "      (VBZ uploads)\n"
            "      (NP (NNS files))\n"
            "      (PP (TO to) (NP (DT a) (JJ remote) (NN machine))))\n"
            "    (. .)"
            ")"
            ")");

    std::cout << "parsed: " << parsed           << "\n";

    // conversely, just build one:
    const s_expr in_code(s_list { 
        "TOP",
        s_list { "S",
            s_list { "NP", s_list { "DT", "This", } },
            s_list { "VP",
                s_list { "VBZ", "uploads" },
                    s_list { "NP", s_list { "NNS", "files" } },
                    s_list { "PP", s_list { "TO", "to" }, s_list { "NP", s_list { "DT", "a" }, s_list { "JJ", "remote" }, s_list { "NN", "machine" } } } },
                s_list { ".", "." }
        }
    });

    // both AST trees are exactly equivalent:
    assert(in_code == parsed);
}

输出(如the coliru link所示)是:

  

parsed: ( TOP ( S ( NP ( DT This ) ) ( VP ( VBZ uploads ) ( NP ( NNS files ) ) ( PP ( TO to ) ( NP ( DT a ) ( JJ remote ) ( NN machine ) ) ) ) ( . . ) ) )

这是完整的计划。请注意,实现解析器占用了所有35行:)并且它非常灵活和高效,感谢Spirit)

完整演示程序

#define BOOST_SPIRIT_DEBUG
#include <boost/spirit/include/qi.hpp>
#include <boost/variant.hpp>
#include <stdexcept>

namespace qi    = boost::spirit::qi;
namespace phx   = boost::phoenix;

using s_expr = boost::make_recursive_variant<std::string, std::vector<boost::recursive_variant_> >::type;
using s_list = std::vector<s_expr>;

template <typename It, typename Skipper = qi::space_type>
    struct parser : qi::grammar<It, s_expr(), Skipper>
{
    parser() : parser::base_type(expr)
    {
        using namespace qi;

        value = lexeme [ +(graph - '(' - ')') ];
        list  = '(' >> *expr >> ')';
        expr  = list | value;

        BOOST_SPIRIT_DEBUG_NODES((expr)(value)(list));
    }

  private:
    qi::rule<It, s_expr(),      Skipper> expr;
    qi::rule<It, std::string(), Skipper> value;
    qi::rule<It, s_list(),      Skipper> list;
};

s_expr parse_s_expr(const std::string& input)
{
    typedef std::string::const_iterator It;

    static const parser<It, qi::space_type> p;

    It f(begin(input)), l(end(input));
    s_expr data;

    if (!qi::phrase_parse(f,l,p,qi::space,data))
        throw std::runtime_error("parse failed: '" + std::string(f,l) + "'");

    return data;
}

namespace std { // a hack for easy debug printing
    static inline std::ostream& operator<<(std::ostream& os, s_list const& l) {
        os << "( "; std::copy(l.begin(), l.end(), std::ostream_iterator<s_expr>(os, " "));
        return os << ")";
    }
}

int main()
{
    s_expr parsed = parse_s_expr(
            "(TOP\n"
            "  (S\n"
            "    (NP (DT This))\n"
            "    (VP\n"
            "      (VBZ uploads)\n"
            "      (NP (NNS files))\n"
            "      (PP (TO to) (NP (DT a) (JJ remote) (NN machine))))\n"
            "    (. .)"
            ")"
            ")");

    std::cout << "parsed: " << parsed           << "\n";

    // conversely, just build one:
    const s_expr in_code(s_list { 
        "TOP",
        s_list { "S",
            s_list { "NP", s_list { "DT", "This", } },
            s_list { "VP",
                s_list { "VBZ", "uploads" },
                    s_list { "NP", s_list { "NNS", "files" } },
                    s_list { "PP", s_list { "TO", "to" }, s_list { "NP", s_list { "DT", "a" }, s_list { "JJ", "remote" }, s_list { "NN", "machine" } } } },
                s_list { ".", "." }
        }
    });

    // both AST trees are exactly equivalent:
    assert(in_code == parsed);
}