使用SQLTransformers
我们可以在数据框架中创建新列,并且Pipeline
也可以SQLTransformers
。我们也可以在数据帧上使用多个selectExpr
方法调用来做同样的事情。
但是应用于同时应用于SQLTransformers
管道的selectExpr调用的性能优化指标是什么?
例如,考虑以下两段代码:
#Method 1
df = spark.table("transactions")
df = df.selectExpr("*","sum(amt) over (partition by account) as acc_sum")
df = df.selectExpr("*","sum(amt) over (partition by dt) as dt_sum")
df.show(10)
#Method 2
df = spark.table("transactions")
trans1 = SQLTransformer(statement ="SELECT *,sum(amt) over (partition by account) as acc_sum from __THIS__")
trans2 = SQLTransformer(statement ="SELECT *,sum(amt) over (partition by dt) as dt_sum from __THIS__")
pipe = Pipeline(stage[trans1,trans2])
transPipe = pipe.fit(df)
transPipe.transform(df).show(10)
这两种计算方法的性能是否相同?
或者是否会对方法1应用一些未在方法2中使用的额外优化?
答案 0 :(得分:2)
无需额外优化。一如既往,如有疑问,请检查执行计划:
df = spark.createDataFrame([(1, 1, 1)], ("amt", "account", "dt"))
(df
.selectExpr("*","sum(amt) over (partition by account) as acc_sum")
.selectExpr("*","sum(amt) over (partition by dt) as dt_sum")
.explain(True))
产生
== Parsed Logical Plan ==
'Project [*, 'sum('amt) windowspecdefinition('dt, unspecifiedframe$()) AS dt_sum#165]
+- AnalysisBarrier Project [amt#22L, account#23L, dt#24L, acc_sum#158L]
== Analyzed Logical Plan ==
amt: bigint, account: bigint, dt: bigint, acc_sum: bigint, dt_sum: bigint
Project [amt#22L, account#23L, dt#24L, acc_sum#158L, dt_sum#165L]
+- Project [amt#22L, account#23L, dt#24L, acc_sum#158L, dt_sum#165L, dt_sum#165L]
+- Window [sum(amt#22L) windowspecdefinition(dt#24L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS dt_sum#165L], [dt#24L]
+- Project [amt#22L, account#23L, dt#24L, acc_sum#158L]
+- Project [amt#22L, account#23L, dt#24L, acc_sum#158L]
+- Project [amt#22L, account#23L, dt#24L, acc_sum#158L, acc_sum#158L]
+- Window [sum(amt#22L) windowspecdefinition(account#23L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS acc_sum#158L], [account#23L]
+- Project [amt#22L, account#23L, dt#24L]
+- LogicalRDD [amt#22L, account#23L, dt#24L], false
== Optimized Logical Plan ==
Window [sum(amt#22L) windowspecdefinition(dt#24L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS dt_sum#165L], [dt#24L]
+- Window [sum(amt#22L) windowspecdefinition(account#23L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS acc_sum#158L], [account#23L]
+- LogicalRDD [amt#22L, account#23L, dt#24L], false
== Physical Plan ==
Window [sum(amt#22L) windowspecdefinition(dt#24L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS dt_sum#165L], [dt#24L]
+- *Sort [dt#24L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(dt#24L, 200)
+- Window [sum(amt#22L) windowspecdefinition(account#23L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS acc_sum#158L], [account#23L]
+- *Sort [account#23L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(account#23L, 200)
+- Scan ExistingRDD[amt#22L,account#23L,dt#24L]
,而
trans2.transform(trans1.transform(df)).explain(True)
产生
== Parsed Logical Plan ==
'Project [*, 'sum('amt) windowspecdefinition('dt, unspecifiedframe$()) AS dt_sum#150]
+- 'UnresolvedRelation `SQLTransformer_4318bd7007cefbf17a97_826abb6c003c`
== Analyzed Logical Plan ==
amt: bigint, account: bigint, dt: bigint, acc_sum: bigint, dt_sum: bigint
Project [amt#22L, account#23L, dt#24L, acc_sum#120L, dt_sum#150L]
+- Project [amt#22L, account#23L, dt#24L, acc_sum#120L, dt_sum#150L, dt_sum#150L]
+- Window [sum(amt#22L) windowspecdefinition(dt#24L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS dt_sum#150L], [dt#24L]
+- Project [amt#22L, account#23L, dt#24L, acc_sum#120L]
+- SubqueryAlias sqltransformer_4318bd7007cefbf17a97_826abb6c003c
+- Project [amt#22L, account#23L, dt#24L, acc_sum#120L]
+- Project [amt#22L, account#23L, dt#24L, acc_sum#120L, acc_sum#120L]
+- Window [sum(amt#22L) windowspecdefinition(account#23L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS acc_sum#120L], [account#23L]
+- Project [amt#22L, account#23L, dt#24L]
+- SubqueryAlias sqltransformer_4688bba599a7f5a09c39_f5e9d251099e
+- LogicalRDD [amt#22L, account#23L, dt#24L], false
== Optimized Logical Plan ==
Window [sum(amt#22L) windowspecdefinition(dt#24L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS dt_sum#150L], [dt#24L]
+- Window [sum(amt#22L) windowspecdefinition(account#23L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS acc_sum#120L], [account#23L]
+- LogicalRDD [amt#22L, account#23L, dt#24L], false
== Physical Plan ==
Window [sum(amt#22L) windowspecdefinition(dt#24L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS dt_sum#150L], [dt#24L]
+- *Sort [dt#24L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(dt#24L, 200)
+- Window [sum(amt#22L) windowspecdefinition(account#23L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS acc_sum#120L], [account#23L]
+- *Sort [account#23L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(account#23L, 200)
+- Scan ExistingRDD[amt#22L,account#23L,dt#24L]
正如您所看到的,优化和物理计划完全相同。