Question

我不太确定Scala和Spark是如何工作的，也许我是以错误的方式编写代码。

我想要实现的功能是，对于给定的Seq[String, Int]，将v._2.path中的随机项目分配给_._2。

为此，我实现了一个方法，并在下一行中调用此方法

def getVerticesWithFeatureSeq(graph: Graph[WikiVertex, WikiEdge.Value]): RDD[(VertexId, WikiVertex)] = {
  graph.vertices.map(v => {
    //For each token in the sequence, assign an article to them based on its path(root to this node)
    println(v._1+" before "+v._2.featureSequence)
    v._2.featureSequence = v._2.featureSequence.map(f => (f._1, v._2.path.apply(new scala.util.Random().nextInt(v._2.path.size))))
    println(v._1+" after "+v._2.featureSequence)
    (v._1, v._2)
  })
}

val dt = getVerticesWithFeatureSeq(wikiGraph)

当我执行它时，我认为println应打印出一些内容，但事实并非如此。如果我添加另一行代码

dt.foreach(println)

<{1}}内的{p> println将正确打印。

是否存在一些spark代码执行的延迟？就像没有人访问变量一样，计算将被推迟甚至取消？

Answer 1

graph.vertices是RDD吗？这可以解释你的问题，因为Spark转换是懒惰的，直到没有执行任何操作，foreach在你的情况下：

val dt = getVerticesWithFeatureSeq(wikiGraph) //no result is computed yet, map transformation is 'recorded'
dt.foreach(println) //foreach action requires a result, this triggers the computation

RDD记住应用的转换，只有在动作需要将结果返回给驱动程序时才会计算它们。

您可以查看http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations以获取更多详细信息以及可用转换和操作列表。

除非访问RDD中的项目，否则Spark的RDD.map（）将不会执行

1 个答案: