我想我不会理解选择或放弃是如何工作的 我正在爆炸数据集,我不希望将一些列复制到新生成的条目中。
val ds = spark.sparkContext.parallelize(Seq(
("2017-01-01 06:15:00", "ASC_a", "1"),
("2017-01-01 06:19:00", "start", "2"),
("2017-01-01 06:22:00", "ASC_b", "2"),
("2017-01-01 06:30:00", "end", "2"),
("2017-01-01 10:45:00", "ASC_a", "3"),
("2017-01-01 10:50:00", "start", "3"),
("2017-01-01 11:22:00", "ASC_c", "4"),
("2017-01-01 11:31:00", "end", "5" )
)).toDF("timestamp", "status", "msg")
ds.show()
val foo = ds.select($"timestamp", $"msg")
val bar = ds.drop($"status")
foo.printSchema()
bar.printSchema()
println("foo " + foo.where($"status" === "end").count)
println("bar" + bar.where($"status" === "end").count)
输出:
root |-- timestamp: string (nullable = true) |-- msg: string (nullable = true) root |-- timestamp: string (nullable = true) |-- msg: string (nullable = true)
foo 2
吧2
为什么我仍然得到两个输出为2
a)没有选择状态
b)状态下降
编辑:
println("foo " + foo.where(foo.col("status") === "end").count)
表示没有列状态。这应该与println("foo " + foo.where($"status" === "end").count)
不一样吗?
答案 0 :(得分:3)
为什么两个
的输出仍为2
因为优化器可以自由地重新组织执行计划。事实上,如果你检查它:
== Physical Plan ==
*Project [_1#4 AS timestamp#8, _3#6 AS msg#10]
+- *Filter (isnotnull(_2#5) && (_2#5 = end))
+- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple3, true])._1, true) AS _1#4, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple3, true])._2, true) AS _2#5, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple3, true])._3, true) AS _3#6]
+- Scan ExternalRDDScan[obj#3]
您会看到过滤器尽早下推并在项目之前执行。所以它相当于:
SELECT _1 AS timetsatmp, _2 AS msg
FROM ds WHERE _2 IS NOT NULL AND _2 = 'end'
可以说这是一个小错误,代码应翻译为
SELECT * FROM (
SELECT _1 AS timetsatmp, _2 AS msg FROM ds
) WHERE _2 IS NOT NULL AND _2 = 'end'
并抛出异常。