我对以下代码提出的问题很少:
val input1 = rawinput.map(_.split("\t")).map(x=>(x(6).trim(),x)).sortByKey()
val input2 = input1.map(x=> x._2.mkString("\t"))
val x0 = input2.map(_.split("\t")).map(x => (x(6),x(0))
val x1 = input2.map(_.split("\t")).map(x => (x(6),x(1))
val x2 = input2.map(_.split("\t")).map(x => (x(6),x(2))
val x3 = input2.map(_.split("\t")).map(x => (x(6),x(3))
val x4 = input2.map(_.split("\t")).map(x => (x(6),x(4))
val x5 = input2.map(_.split("\t")).map(x => (x(6),x(5))
val x6 = input2.map(_.split("\t")).map(x => (x(6),x(6))
val x = x0 union x1 union x2 union x3 union x4 union x5 union x6
<pre>
**Lineage Graph:**
(7) UnionRDD[25] at union at rddCustUtil.scala:78 []
| UnionRDD[24] at union at rddCustUtil.scala:78 []
| UnionRDD[23] at union at rddCustUtil.scala:78 []
| UnionRDD[22] at union at rddCustUtil.scala:78 []
| UnionRDD[21] at union at rddCustUtil.scala:78 []
| UnionRDD[20] at union at rddCustUtil.scala:78 []
| MapPartitionsRDD[7] at map at rddCustUtil.scala:43 []
| MapPartitionsRDD[6] at map at rddCustUtil.scala:43 []
| MapPartitionsRDD[5] at map at rddCustUtil.scala:40 []
| ShuffledRDD[4] at sortByKey at rddCustUtil.scala:38 []
+-(1) MapPartitionsRDD[3] at map at rddCustUtil.scala:38 []
| MapPartitionsRDD[2] at map at rddCustUtil.scala:38 []
| /Data/ MapPartitionsRDD[1] at textFile at rddCustUtil.scala:35 []
| /Data/ HadoopRDD[0] at textFile at rddCustUtil.scala:35 []
| MapPartitionsRDD[9] at map at rddCustUtil.scala:48 []
| MapPartitionsRDD[8] at map at rddCustUtil.scala:48 []
| MapPartitionsRDD[5] at map at rddCustUtil.scala:40 []
| ShuffledRDD[4] at sortByKey at rddCustUtil.scala:38 []
+-(1) MapPartitionsRDD[3] at map at rddCustUtil.scala:38 []
| MapPartitionsRDD[2] at map at rddCustUtil.scala:38 []
| /Data/ MapPartitionsRDD[1] at textFile at rddCustUtil.scala:35 []
| /Data/ HadoopRDD[0] at textFile at rddCustUtil.scala:35 []
| MapPartitionsRDD[11] at map at rddCustUtil.scala:53 []
| MapPartitionsRDD[10] at map at rddCustUtil.scala:53 []
| MapPartitionsRDD[5] at map at rddCustUtil.scala:40 []
| ShuffledRDD[4] at sortByKey at rddCustUtil.scala:38 []
+-(1) MapPartitionsRDD[3] at map at rddCustUtil.scala:38 []
| MapPartitionsRDD[2] at map at rddCustUtil.scala:38 []
| /Data/ MapPartitionsRDD[1] at textFile at rddCustUtil.scala:35 []
| /Data/ HadoopRDD[0] at textFile at rddCustUtil.scala:35 []
| MapPartitionsRDD[13] at map at rddCustUtil.scala:58 []
| MapPartitionsRDD[12] at map at rddCustUtil.scala:58 []
| MapPartitionsRDD[5] at map at rddCustUtil.scala:40 []
| ShuffledRDD[4] at sortByKey at rddCustUtil.scala:38 []
+-(1) MapPartitionsRDD[3] at map at rddCustUtil.scala:38 []
| MapPartitionsRDD[2] at map at rddCustUtil.scala:38 []
| /Data/ MapPartitionsRDD[1] at textFile at rddCustUtil.scala:35 []
| /Data/ HadoopRDD[0] at textFile at rddCustUtil.scala:35 []
| MapPartitionsRDD[15] at map at rddCustUtil.scala:63 []
| MapPartitionsRDD[14] at map at rddCustUtil.scala:63 []
| MapPartitionsRDD[5] at map at rddCustUtil.scala:40 []
| ShuffledRDD[4] at sortByKey at rddCustUtil.scala:38 []
+-(1) MapPartitionsRDD[3] at map at rddCustUtil.scala:38 []
| MapPartitionsRDD[2] at map at rddCustUtil.scala:38 []
| /Data/ MapPartitionsRDD[1] at textFile at rddCustUtil.scala:35 []
| /Data/ HadoopRDD[0] at textFile at rddCustUtil.scala:35 []
| MapPartitionsRDD[17] at map at rddCustUtil.scala:68 []
| MapPartitionsRDD[16] at map at rddCustUtil.scala:68 []
| MapPartitionsRDD[5] at map at rddCustUtil.scala:40 []
| ShuffledRDD[4] at sortByKey at rddCustUtil.scala:38 []
+-(1) MapPartitionsRDD[3] at map at rddCustUtil.scala:38 []
| MapPartitionsRDD[2] at map at rddCustUtil.scala:38 []
| /Data/ MapPartitionsRDD[1] at textFile at rddCustUtil.scala:35 []
| /Data/ HadoopRDD[0] at textFile at rddCustUtil.scala:35 []
| MapPartitionsRDD[19] at map at rddCustUtil.scala:73 []
| MapPartitionsRDD[18] at map at rddCustUtil.scala:73 []
| MapPartitionsRDD[5] at map at rddCustUtil.scala:40 []
| ShuffledRDD[4] at sortByKey at rddCustUtil.scala:38 []
+-(1) MapPartitionsRDD[3] at map at rddCustUtil.scala:38 []
| MapPartitionsRDD[2] at map at rddCustUtil.scala:38 []
| /Data/ MapPartitionsRDD[1] at textFile at rddCustUtil.scala:35 []
| /Data/ HadoopRDD[0] at textFile at rddCustUtil.scala:35 []
</pre>
答案 0 :(得分:2)
将执行多少个shuffle阶段
实际上,对数据进行排序所需的随机播放发生了7次,因为Spark的评估是懒惰的并且按需运行,除非缓存,否则将重新计算DAG中需要它的每个分支。要解决此问题(并且可能更快地进行此计算),您可以在多次使用之前缓存(或更常见地,持久)input2
:
val input1 = rawinput.map(_.split("\t")).map(x=>(x(6).trim(),x)).sortByKey()
val input2 = input1.map(x=> x._2.mkString("\t")).cache()
// continue as before
您能否在DAG流程下给我详细说明
使用以下计算“单独”计算每个x_
RDD:
+-(1) MapPartitionsRDD[3] at map at rddCustUtil.scala:38 []
| MapPartitionsRDD[2] at map at rddCustUtil.scala:38 []
| /Data/ MapPartitionsRDD[1] at textFile at rddCustUtil.scala:35 []
| /Data/ HadoopRDD[0] at textFile at rddCustUtil.scala:35 []
| MapPartitionsRDD[9] at map at rddCustUtil.scala:48 []
| MapPartitionsRDD[8] at map at rddCustUtil.scala:48 []
| MapPartitionsRDD[5] at map at rddCustUtil.scala:40 []
| ShuffledRDD[4] at sortByKey at rddCustUtil.scala:38 []
其中显示了从textFile创建rawinput
的计算,然后是排序和三个map
操作。
然后,你有6个联合操作来取消这7个RDD。
此操作费用昂贵吗?
是,似乎就是这样。如上所述,缓存可以使它更快 - 但有一种更好的方法来实现这一点 - 而不将RDD分成许多单独的:
val x = rawinput.map(_.split("\t"))
.keyBy(_(6).trim()) // extract key
.flatMap{ case (k, arr) => arr.take(7).zipWithIndex.map((k, _)) } // flatMap into (key, (value, index))
.sortBy { case (k, (_, index)) => (index, k) } // sort by index first, key second
.map { case (k, (value, _)) => (k, value) } // remove index, it was just used for sorting
这将执行单个shuffle操作,并且不需要持久保存数据。 DAG看起来像这样:
(4) MapPartitionsRDD[9] at map at Test.scala:75 []
| MapPartitionsRDD[8] at sortBy at Test.scala:74 []
| ShuffledRDD[7] at sortBy at Test.scala:74 []
+-(4) MapPartitionsRDD[4] at sortBy at Test.scala:74 []
| MapPartitionsRDD[3] at flatMap at Test.scala:73 []
| MapPartitionsRDD[2] at keyBy at Test.scala:72 []
| MapPartitionsRDD[1] at map at Test.scala:71 []
| ParallelCollectionRDD[0] at parallelize at Test.scala:64 []