Question

I have a couple of questions about Spark internals, specifically RDDs. Based on what's in the docs, the lineage graphs of RDDs are DAG structures.

Are they persisted anywhere ? Is the 'golden copy' what is maintained by the driver program on the master node of the cluster ?

I have read documentation that states that the driver program sends off tasks to the executors on worker nodes for processing.

What do these tasks look like ? Do they consist of an RDD object along with information as to which partition of the data that will be processed upon receipt of an action ?
What happens in the case where a node goes down and the data for a partition needs to be recomputed ? What are the exact sequence of steps that are executed ?

Any references to code or illustrative articles detailing the process would be greatly appreciated.

Answer 1

我会尽力回答你的问题。

val rdd = sc.textFile("/path/to/file")
val rdd2 = rdd.map(line => line...)
val rdd3 = rdd2.filter(x => x...)
rdd3.saveAsTextFile("/path/to/output")

我创建了一个名为＆＃34; rdd＆＃34;的rdd。实际上，这个RDD只是一组关于如何从数据源（如HDFS）访问数据的指令。然后我创建了＆＃34; rdd2＆＃34;通过＆＃34; rdd＆＃34;并将其映射到某个函数，同样适用于＆＃34; rdd3＆＃34;。我在这里做的是建立一个血统。 Spark有两个可以应用于数据的操作：转换和操作。转换采用现有RDD并返回新的RDD。操作采用RDD并返回结果。在我们在RDD上调用操作之前，Spark不会处理数据。动作强制Spark返回结果，并处理数据。没有＆＃34;黄金副本。＆＃34;

在第一行代码中，我们告诉spark，如果在＆＃34; rdd＆＃34;上执行操作，则从HDFS读取文件。在第二行代码中，如果在＆＃34; rdd2＆＃34;上调用一个动作，我们就会告诉它。从HDFS读取文件然后对其应用一些转换。我们正在对＆＃34; rdd＆＃34;进行转型。到达＆＃34; rdd2＆＃34;。这样做是建立一个血统或处理这些数据的方法。在第三行代码中，我们通过说“采取已经做过的事情”来进一步建立血统，并且＃34; rdd2＆＃34;然后对其应用过滤器。

saveAsTextFile是一个告诉Spark保存＆＃34; rdd3＆＃34;某处。

当Spark处理数据时，它正在通过这些转换进行流式传输，因此从HDFS读取数据，将一个函数应用于数据，然后应用过滤器。 RDD实际上并不存在，它们只是关于如何处理/访问数据的一组说明。

我们应用于数据的这些功能是以任务的形式。执行任务时，Spark会将处理数据所需的信息传递给工作节点。此任务中包含的唯一内容是如何处理数据。因此，在我们的案例中，地图和过滤器中的任何内容都将以任务的形式发送到工作节点。工作节点将采用这些指令并将其应用于数据。每个工作人员一次将处理一个数据块，因此如果我们在sc.textFile中指定的文件由HDFS中的4个块组成，则将有4个任务，一个用于HDFS中的每个数据块

如果工作节点在处理数据时出现故障，则当前分配给该工作人员的所有任务将被重新分配给另一个工作人员。

Answer 2

我决定直接查看源代码。具体来说，这一行 芯/ SRC /主/阶/组织/阿帕奇/火花/ RDD / RDD.scala 似乎回答了上面的问题＃2：

left.join(broadcast(right), "joinKey")

Answer 3

请仔细阅读MapReduce与Spark之间的映射示例，这将帮助您直观地了解Spark中发生的事情。

http://bytepadding.com/big-data/spark/understanding-spark-through-map-reduce/

http://bytepadding.com/big-data/spark/spark-code-analysis/

Questions on Apache Spark Internals - RDDs

3 个答案: