火花数vs拍和长度

时间:2019-02-18 10:02:22

标签: scala performance apache-spark apache-spark-sql query-optimization

我在运行齐柏林飞艇笔记本电脑时正在使用com.datastax.spark:spark-cassandra-connector_2.11:2.4.0,但不了解spark中两个操作之间的区别。一种操作花费大量时间进行计算,第二种立即执行。有人可以向我解释两个操作之间的区别:

import com.datastax.spark.connector._
import org.apache.spark.sql.cassandra._

import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import spark.implicits._

case class SomeClass(val someField:String)

val timelineItems = spark.read.format("org.apache.spark.sql.cassandra").options(scala.collection.immutable.Map("spark.cassandra.connection.host" -> "127.0.0.1", "table" -> "timeline_items", "keyspace" -> "timeline" )).load()
//some simplified code:
val timelineRow = timelineItems
        .map(x => {SomeClass("test")})
        .filter(x => x != null)
        .toDF()
        .limit(4)

//first operation (takes a lot of time. It seems spark iterates through all items in Cassandra and doesn't use laziness with limit 4)
println(timelineRow.count()) //return: 4

//second operation (executes immediately); 300 - just random number which doesn't affect the result
println(timelineRow.take(300).length) //return: 4

1 个答案:

答案 0 :(得分:4)

您看到的是Limit(类似转换的操作)和CollectLimit(类似动作的操作)的实现之间的区别。但是,时间上的差异极具误导性,在一般情况下不会出现这种情况。

首先让我们创建一个MCVE

spark.conf.set("spark.sql.files.maxPartitionBytes", 500)

val ds = spark.read
  .text("README.md")
  .as[String]
  .map{ x => {
    Thread.sleep(1000)
    x
   }}

val dsLimit4 = ds.limit(4)

确保我们从干净的开始:

spark.sparkContext.statusTracker.getJobIdsForGroup(null).isEmpty
Boolean = true

调用count

dsLimit4.count()

并查看执行计划(从Spark UI):

== Parsed Logical Plan ==
Aggregate [count(1) AS count#12L]
+- GlobalLimit 4
   +- LocalLimit 4
      +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#7]
         +- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#6: java.lang.String
            +- DeserializeToObject cast(value#0 as string).toString, obj#5: java.lang.String
               +- Relation[value#0] text

== Analyzed Logical Plan ==
count: bigint
Aggregate [count(1) AS count#12L]
+- GlobalLimit 4
   +- LocalLimit 4
      +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#7]
         +- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#6: java.lang.String
            +- DeserializeToObject cast(value#0 as string).toString, obj#5: java.lang.String
               +- Relation[value#0] text

== Optimized Logical Plan ==
Aggregate [count(1) AS count#12L]
+- GlobalLimit 4
   +- LocalLimit 4
      +- Project
         +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#7]
            +- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#6: java.lang.String
               +- DeserializeToObject value#0.toString, obj#5: java.lang.String
                  +- Relation[value#0] text

== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)], output=[count#12L])
+- *(2) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#15L])
   +- *(2) GlobalLimit 4
      +- Exchange SinglePartition
         +- *(1) LocalLimit 4
            +- *(1) Project
               +- *(1) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#7]
                  +- *(1) MapElements <function1>, obj#6: java.lang.String
                     +- *(1) DeserializeToObject value#0.toString, obj#5: java.lang.String
                        +- *(1) FileScan text [value#0] Batched: false, Format: Text, Location: InMemoryFileIndex[file:/path/to/README.md], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string>

核心组件是

+- *(2) GlobalLimit 4
   +- Exchange SinglePartition
      +- *(1) LocalLimit 4

表示我们可以预期有多个阶段的广泛操作。我们可以看到一个工作

spark.sparkContext.statusTracker.getJobIdsForGroup(null)
Array[Int] = Array(0)

分两个阶段

spark.sparkContext.statusTracker.getJobInfo(0).get.stageIds
Array[Int] = Array(0, 1)

有八个

spark.sparkContext.statusTracker.getStageInfo(0).get.numTasks
Int = 8

和一个

spark.sparkContext.statusTracker.getStageInfo(1).get.numTasks
Int = 1

任务。

现在让我们将其与

进行比较
dsLimit4.take(300).size

生成以下内容

== Parsed Logical Plan ==
GlobalLimit 300
+- LocalLimit 300
   +- GlobalLimit 4
      +- LocalLimit 4
         +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#7]
            +- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#6: java.lang.String
               +- DeserializeToObject cast(value#0 as string).toString, obj#5: java.lang.String
                  +- Relation[value#0] text

== Analyzed Logical Plan ==
value: string
GlobalLimit 300
+- LocalLimit 300
   +- GlobalLimit 4
      +- LocalLimit 4
         +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#7]
            +- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#6: java.lang.String
               +- DeserializeToObject cast(value#0 as string).toString, obj#5: java.lang.String
                  +- Relation[value#0] text

== Optimized Logical Plan ==
GlobalLimit 4
+- LocalLimit 4
   +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#7]
      +- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#6: java.lang.String
         +- DeserializeToObject value#0.toString, obj#5: java.lang.String
            +- Relation[value#0] text

== Physical Plan ==
CollectLimit 4
+- *(1) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#7]
   +- *(1) MapElements <function1>, obj#6: java.lang.String
      +- *(1) DeserializeToObject value#0.toString, obj#5: java.lang.String
         +- *(1) FileScan text [value#0] Batched: false, Format: Text, Location: InMemoryFileIndex[file:/path/to/README.md], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string>

尽管仍然存在全局和局部限制,但中间没有任何交换。因此,我们可以期望进行单阶段操作。请注意,规划师将限制范围缩小到更具限制性的值。

按预期,我们将看到一个新工作:

spark.sparkContext.statusTracker.getJobIdsForGroup(null)
Array[Int] = Array(1, 0)

仅生成一个阶段:

spark.sparkContext.statusTracker.getJobInfo(1).get.stageIds
Array[Int] = Array(2)

仅执行一项任务

spark.sparkContext.statusTracker.getStageInfo(2).get.numTasks
Int = 1

这对我们意味着什么?

  • count情况下,Spark使用了广泛的转换,并实际上在每个分区上应用了LocalLimit并混洗了部分结果以执行GlobalLimit
  • take案例中,Spark使用了窄变换,仅在第一个分区上评估了LocalLimit

很明显,后一种方法不适用于第一个分区中的值数小于请求的限制。

val dsLimit105 = ds.limit(105) // There are 105 lines

在这种情况下,第一个count将使用与以前完全相同的逻辑(建议您凭经验进行确认),但是take将采用截然不同的路径。到目前为止,我们仅触发了两个作业:

spark.sparkContext.statusTracker.getJobIdsForGroup(null)
Array[Int] = Array(1, 0)

现在,如果我们执行

dsLimit105.take(300).size

您会看到它还需要3个工作:

spark.sparkContext.statusTracker.getJobIdsForGroup(null)
Array[Int] = Array(4, 3, 2, 1, 0)

那么这是怎么回事?如前所述,评估单个分区通常不足以满足限制。在这种情况下,Spark会迭代评估分区上的LocalLimit,直到满足GlobalLimit为止,从而增加每次迭代中采用的分区数量。

这种策略可能会对性能产生重大影响。单独启动Spark作业并不便宜,并且在某些情况下,如果上游对象是经过广泛转换的结果,事情可能会变得很丑陋(在最佳情况下,您可以读取随机文件,但是如果由于某些原因而丢失随机文件,则可能会强制Spark重新执行所有依赖项。

总结

  • take是一个动作,在上游过程狭窄的特定情况下可能会短路,并且LocalLimits可以使用前几个分区满足GlobalLimits
  • limit是一个转换,并且始终评估所有LocalLimits,因为没有迭代的逃生舱口。

虽然在特定情况下一个可以比另一个更好,但是它们之间不能互换,也不能保证总体上有更好的性能。