不确定如何使用现有集合中的列创建ArangoDB图

时间:2017-11-22 23:26:16

标签: arangodb

背景

我有一个rocksdb集合,其中包含三个字段:_id,author,subreddit。

问题

我想创建一个Arango图,它创建一个连接这两个现有列的图形。但是示例和驱动程序似乎只接受集合作为其边缘定义。

问题

ArangoDb文档缺少有关如何使用从同一集合中提取的边和节点创建图形的信息。

编辑:

解决方案

这是通过此Arangodb issues ticket的代码更改修复的。

4 个答案:

答案 0 :(得分:3)

这是使用jq(一种面向JSON的命令行工具)执行此操作的一种方法。

首先,概述步骤:

1)使用arangoexport将作者/子信用集导出到文件,例如,exported.json;

2)运行jq脚本nodes_and_edges.jq,如下所示;

3)使用arangoimp将(2)中生成的JSON导入ArangoDB。

图表可以通过多种方式存储在ArangoDB中,因此最终您可能希望相应地调整nodes_and_edges.jq(例如,首先生成节点,然后生成边缘)。

INDEX

如果您的jq没有定义INDEX,请使用:

def INDEX(stream; idx_expr):
  reduce stream as $row ({};
    .[$row|idx_expr|
      if type != "string" then tojson
      else .
      end] |= $row);
def INDEX(idx_expr): INDEX(.[]; idx_expr);

nodes_and_edges.jq

# This module is for generating JSON suitable for importing into ArangoDB.

### Generic Functions

# nodes/2
# $name must be the name of the ArangoDB collection of nodes corresponding to $key.
# The scheme for generating key names can be altered by changing the first
# argument of assign_keys, e.g. to "" if no prefix is wanted.
def nodes($key; $name):
  map( {($key): .[$key]} ) | assign_keys($name[0:1] + "_"; 1);

def assign_keys(prefix; start):
  . as $in
  | reduce range(0;length) as $i ([];
    . + [$in[$i] + {"_key": "\(prefix)\(start+$i)"}]);

# nodes_and_edges facilitates the normalization of an implicit graph
# in an ArangoDB "document" collection of objects having $from and $to keys.
# The input should be an array of JSON objects, as produced 
# by arangoexport for a single collection.
# If $nodesq is truthy, then the JSON for both the nodes and edges is emitted,
# otherwise only the JSON for the edges is emitted.
# 
# The first four arguments should be strings.
# 
# $from and $to should be the key names in . to be used for the from-to edges;
# $name1 and $name2 should be the names of the corresponding collections of nodes.
def nodes_and_edges($from; $to; $name1; $name2; $nodesq ):
  def dict($s): INDEX(.[$s]) | map_values(._key);
  def objects: to_entries[] | {($from): .key, "_key": .value};
  (nodes($from; $name1) | dict($from)) as $fdict
  | (nodes($to; $name2) | dict($to)  ) as $tdict
  | (if $nodesq then $fdict, $tdict | objects
     else empty end),
    (.[] | {_from: "\($name1)/\($fdict[.[$from]])",
            _to:   "\($name2)/\($tdict[.[$to]])"} )  ;


### Problem-Specific Functions

# If you wish to generate the collections separately,
# then these will come in handy:
def authors: nodes("author"; "authors");
def subredits: nodes("subredit"; "subredits");

def nodes_and_edges:
  nodes_and_edges("author"; "subredit"; "authors"; "subredits"; true);

nodes_and_edges

调用

jq -cf extract_nodes_edges.jq exported.json

此调用将为" authors"生成一组JSONL(JSON-Lines),一个用于" subredits"和边缘集合。

实施例

exported.json
[
  {"_id":"test/115159","_key":"115159","_rev":"_V8JSdTS---","author": "A", "subredit": "S1"},
  {"_id":"test/145120","_key":"145120","_rev":"_V8ONdZa---","author": "B", "subredit": "S2"},
  {"_id":"test/114474","_key":"114474","_rev":"_V8JZJJS---","author": "C", "subredit": "S3"}
]

输出

{"author":"A","_key":"name_1"}
{"author":"B","_key":"name_2"}
{"author":"C","_key":"name_3"}
{"subredit":"S1","_key":"sid_1"}
{"subredit":"S2","_key":"sid_2"}
{"subredit":"S3","_key":"sid_3"}
{"_from":"authors/name_1","_to":"subredits/sid_1"}
{"_from":"authors/name_2","_to":"subredits/sid_2"}
{"_from":"authors/name_3","_to":"subredits/sid_3"}

答案 1 :(得分:2)

对于图形,您需要边缘的边集合和节点的顶点集合。您不能仅使用一个集合创建图表。

文档中的this topic可能对您有所帮助。

答案 2 :(得分:2)

请注意,以下查询需要一段时间才能完成这个庞大的数据集,但是它们应该在几个小时后成功完成。

我们启动arangoimp来导入我们的基础数据集:

arangoimp --create-collection true  --collection RawSubReddits --type jsonl ./RC_2017-01 

我们使用arangosh创建我们的最终数据所在的集合:

db._create("authors")
db._createEdgeCollection("authorsToSubreddits")

我们通过简单地忽略任何后来出现的重复作者填写作者集合; 我们将使用MD5函数计算作者的_key, 所以它遵守_key中允许的字符的限制,稍后我们可以通过MD5()字段再次调用author来了解它:

db._query(`
  FOR item IN RawSubReddits
    INSERT {
      _key: MD5(item.author),
      author: item.author
      } INTO authors
        OPTIONS { ignoreErrors: true }`);

在我们填充了第二个顶点集合之后(我们将导入的集合保留为第一个顶点集合),我们必须计算边缘。 由于每个作者都可以创建几个子编辑,因此最有可能是源自每个作者的几个边缘。就像之前提到的, 我们可以再次使用MD5() - 函数来引用之前创建的作者:

 db._query(`
   FOR onesubred IN RawSubReddits
     INSERT {
       _from: CONCAT('authors/', MD5(onesubred.author)),
       _to: CONCAT('RawSubReddits/', onesubred._key)
     } INTO  authorsToSubreddits")

填充边缘集合后(可能需要一段时间 - 我们正在讨论的是4亿条边缘,对吧? - 我们创建图形描述:

db._graphs.save({
  "_key": "reddits",
  "orphanCollections" : [ ],
  "edgeDefinitions" : [ 
    {
      "collection": "authorsToSubreddits",
      "from": ["authors"],
      "to": ["RawSubReddits"]
    }
  ]
})

我们现在可以使用UI浏览图表,或使用AQL查询浏览图表。让我们从该列表中选择随机的第一作者:

db._query(`for author IN authors LIMIT 1 RETURN author`).toArray()
[ 
  { 
    "_key" : "1cec812d4e44b95e5a11f3cbb15f7980", 
    "_id" : "authors/1cec812d4e44b95e5a11f3cbb15f7980", 
    "_rev" : "_W_Eu-----_", 
    "author" : "punchyourbuns" 
  } 
]

我们确定了一位作者,现在为他运行graph query

db._query(`FOR vertex, edge, path IN 0..1
   OUTBOUND 'authors/1cec812d4e44b95e5a11f3cbb15f7980'
   GRAPH 'reddits'
   RETURN path`).toArray()

其中一条结果路径如下:

{ 
  "edges" : [ 
    { 
      "_key" : "128327199", 
      "_id" : "authorsToSubreddits/128327199", 
      "_from" : "authors/1cec812d4e44b95e5a11f3cbb15f7980", 
      "_to" : "RawSubReddits/38026350", 
      "_rev" : "_W_LOxgm--F" 
    } 
  ], 
  "vertices" : [ 
    { 
      "_key" : "1cec812d4e44b95e5a11f3cbb15f7980", 
      "_id" : "authors/1cec812d4e44b95e5a11f3cbb15f7980", 
      "_rev" : "_W_HAL-y--_", 
      "author" : "punchyourbuns" 
    }, 
    { 
      "_key" : "38026350", 
      "_id" : "RawSubReddits/38026350", 
      "_rev" : "_W-JS0na--b", 
      "distinguished" : null, 
      "created_utc" : 1484537478, 
      "id" : "dchfe6e", 
      "edited" : false, 
      "parent_id" : "t1_dch51v3", 
      "body" : "I don't understand tension at all."
         "Mine is set to auto."
         "I'll replace the needle and rethread. Thanks!", 
      "stickied" : false, 
      "gilded" : 0, 
      "subreddit" : "sewing", 
      "author" : "punchyourbuns", 
      "score" : 3, 
      "link_id" : "t3_5o66d0", 
      "author_flair_text" : null, 
      "author_flair_css_class" : null, 
      "controversiality" : 0, 
      "retrieved_on" : 1486085797, 
      "subreddit_id" : "t5_2sczp" 
    } 
  ] 
}

答案 3 :(得分:1)

这是一个AQL解决方案,但前提是所有引用的集合都已存在,并且UPSERT不是必需的。

FOR v IN testcollection
  LET a = v.author
  LET s = v.subredit
  FILTER a
  FILTER s
  LET fid = (INSERT {author: a}   INTO authors RETURN NEW._id)[0]
  LET tid = (INSERT {subredit: s} INTO subredits RETURN NEW._id)[0]
  INSERT {_from: fid, _to: tid} INTO author_of
  RETURN [fid, tid]