Question

如果你想用Graphs做任何有趣的事情 - 无论是GraphX还是新的GraphFrames - 你最终会做递归算法。我遇到的问题是，当使用DataFrames时，算法的每次迭代都需要越来越长的时间，并且每次迭代都会启动更多的执行阶段。我更像是一个功能性的Spark用户 - 我可以让事情发生，但是没有完全掌握引擎盖下发生的事情。但我的猜测是血统链不断延伸，并且在不破坏血统链的情况下，每个步骤都会重新计算早期的迭代。所以迭代1进行迭代1;迭代2再次迭代1，然后迭代2;迭代3必须做1，然后再做2，等等。

所以我的第一个问题：这真的发生了什么，或多或少？

为了测试它，我一直在玩RDD.checkpoint。它似乎有所帮助，但我无法证明这一点。这是我的第二个问题 - 告诉我使用checkpoint的方式是否有帮助。

最后，听听其他可能的解决方案会很棒。也许Spark甚至不是正确的答案。我对任何事都持开放态度。

为了测试所有这些，我一直在使用一种简单的算法来填充顶点属性 - 一种属性继承。给出这样的图表：

val nodes = Seq(
    (1L, Option(1L), Option(1L)),
    (2L, None, Option(2L)),
    (3L, Option(2L), None),
    (4L, None, None)
).toDF("id","inputType","recurrence")

val edges = Seq(
    (1L, 2L, "parent"),
    (2L, 4L, "parent"),
    (1L, 3L, "parent")
).toDF("src","dst","type")

一旦我在顶点中填充缺少的属性，我应该得到这样的结果：

+---+---------+----------+
| id|inputType|recurrence|
+---+---------+----------+
|  1|        1|         1|
|  2|        1|         2|
|  3|        2|         1|
|  4|        1|         2|
+---+---------+----------+

顶点1L是父对象，其他顶点从父对象继承了缺失属性，如果需要，会向上链接。

算法实际上并不是很复杂 - 我将使用我自己的拼凑在一起的DataFrame / Graph算法而不是GraphFrames。

首先，我将定义一个函数，用于从节点和边缘创建边三元组：

import org.apache.spark.sql.DataFrame
def triplets(vertices: DataFrame, edges: DataFrame) : DataFrame = {
  edges.toDF(edges.columns.map(c => "edge_" + c):_*)
    .join(vertices.toDF(vertices.columns.map(c => "src_" + c):_*), col("edge_src") === col("src_id"))
    .join(vertices.toDF(vertices.columns.map(c => "dst_" + c):_*), col("edge_dst") === col("dst_id"))
}

根据上述数据，triplets(nodes,edges)显示：

+--------+--------+---------+------+-------------+--------------+------+-------------+--------------+
|edge_src|edge_dst|edge_type|src_id|src_inputType|src_recurrence|dst_id|dst_inputType|dst_recurrence|
+--------+--------+---------+------+-------------+--------------+------+-------------+--------------+
|       1|       2|   parent|     1|            1|             1|     2|         null|             2|
|       1|       3|   parent|     1|            1|             1|     3|            2|          null|
|       2|       4|   parent|     2|         null|             2|     4|         null|          null|
+--------+--------+---------+------+-------------+--------------+------+-------------+--------------+

到目前为止一直很好，现在是一个递归函数，可以在层次结构中填充null值：

def fillVertices(vertices: DataFrame, edges: DataFrame) : (DataFrame, DataFrame) = {
  val vertexAttributes = vertices.columns.filter(c => c != "id")
  val edgeAttributes = edges.columns.filter(c => (c != "src" && c != "dst"))

  val messages = triplets(vertices,edges).select(
    Seq(col("edge_src"), col("edge_dst")) ++ vertexAttributes.map(attr => when(col("src_" + attr).isNotNull && col("dst_" + attr).isNull, col("src_" + attr)) as "msg_" + attr):_*
  ).filter(
    vertexAttributes.map(attr => col("msg_" + attr).isNotNull).fold(lit(false)){ (a,b) => a || b }
  ).groupBy(col("edge_dst") as "msg_dst")
   .agg(max(col("msg_" + vertexAttributes(0))) as ("msg_" + vertexAttributes(0)), vertexAttributes.slice(1,vertexAttributes.length).map(c => max(col("msg_" + c)) as ("msg_" + c)):_*)

  if (! messages.rdd.isEmpty) {
    val newVerts = vertices.join(messages, col("id") === col("msg_dst"), "left_outer").select(Seq(col("id")) ++ vertexAttributes.map(c => coalesce(col(c), col("msg_" + c)) as c):_*)
    fillVertices(newVerts, edges)
  }
  else (vertices,edges)
}

如果您执行fillVertices(nodes,edges)._1.show，它确实会显示正确的结果 - 所有节点都正确填充了null个值。然而，它需要一个荒谬的计算阶段。

再次注意，这与我在GraphFrames看到的行为非常相似 - 我不认为它与我正在做的具体相关，而是Spark中递归算法的一般问题。

就像我说的那样，我已经尝试检查潜在的RDD，它似乎有所帮助。我用它来检查DataFrame：

sc.setCheckpointDir("/your/checkpoint/dir")
def dfCheckpoint(df: DataFrame) : DataFrame = {
  df.rdd.checkpoint
  if (df.rdd.count > 0) {
    df.sqlContext.createDataFrame(df.rdd, df.schema)
  }
  else df
}

然后并排测试，这里的算法与上面相同，只是新创建的节点DataFrame在返回之前会被检查点。

def fillVerticesCheckpoint(vertices: DataFrame, edges: DataFrame) : (DataFrame, DataFrame) = {
  val vertexAttributes = vertices.columns.filter(c => c != "id")
  val edgeAttributes = edges.columns.filter(c => (c != "src" && c != "dst"))

  val messages = triplets(vertices, edges).select(
    Seq(col("edge_src"), col("edge_dst")) ++ vertexAttributes.map(attr => when(col("src_" + attr).isNotNull && col("dst_" + attr).isNull, col("src_" + attr)) as "msg_" + attr):_*
  ).filter(
    vertexAttributes.map(attr => col("msg_" + attr).isNotNull).fold(lit(false)){ (a,b) => a || b }
  ).groupBy(col("edge_dst") as "msg_dst")
   .agg(max(col("msg_" + vertexAttributes(0))) as ("msg_" + vertexAttributes(0)), vertexAttributes.slice(1,vertexAttributes.length).map(c => max(col("msg_" + c)) as ("msg_" + c)):_*)

  if (! messages.rdd.isEmpty) {
    val newVerts = vertices.join(messages, col("id") === col("msg_dst"), "left_outer").select(Seq(col("id")) ++ vertexAttributes.map(c => coalesce(col(c), col("msg_" + c)) as c):_*)
    fillVerticesCheckpoint(dfCheckpoint(newVerts), edges)
  }
  else (vertices, edges)
}

现在，如果你做fillVerticesCheckpoint(nodes,edges)._1.show，它的完成速度会快得多。似乎有更少的阶段。我不知道如何量化它，但似乎检查点版本的阶段数是非检查点的1/3。

基于我所看到的，那么，我猜我的第一个问题的答案是，是的，这是一个沿袭问题。我的第二个问题的答案似乎是肯定的，检查点使其变得更好。但对这两者进行确认真是太好了。

至于我的最后一点，解决同一问题的其他方法，我能想到的唯一想法是通过在每次迭代之间保存DataFrames到Parquet文件来创建我自己的检查点。还有别的人吗？

递归算法和Spark DataFrames的问题

0 个答案: