考虑以下pyspark代码
def transformed_data(spark):
df = spark.read.json('data.json')
df = expensive_transformation(df) # (A)
return df
df1 = transformed_data(spark)
df = transformed_data(spark)
df1 = foo_transform(df1)
df = bar_transform(df)
return df.join(df1)
我的问题是:在transformed_data
中对final_view
上定义为(A)的操作是否进行了优化,因此只能执行一次?
请注意,此代码不等同于
df1 = transformed_data(spark)
df = df1
df1 = foo_transform(df1)
df = bar_transform(df)
df.join(df1)
(至少从Python的角度来看,在这种情况下为id(df1) = id(df)
。
更广泛的问题是:优化两个相等的DAG时,火花会考虑什么:DAG(由其边缘和节点定义)是否相等,或者它们的对象ID(df = df1
)是否相等? >
答案 0 :(得分:4)
金田。它依赖于Spark具有足够的信息来推断依赖关系。
例如,我按照说明复制了您的示例:
from pyspark.sql.functions import hash
def f(spark, filename):
df=spark.read.csv(filename)
df2=df.select(hash('_c1').alias('hashc2'))
df3=df2.select(hash('hashc2').alias('hashc3'))
df4=df3.select(hash('hashc3').alias('hashc4'))
return df4
filename = 'some-valid-file.csv'
df_a = f(spark, filename)
df_b = f(spark, filename)
assert df_a != df_b
df_joined = df_a.join(df_b, df_a.hashc4==df_b.hashc4, how='left')
如果我使用df_joined.explain(extended=True)
解释此结果数据帧,则会看到以下四个计划:
== Parsed Logical Plan ==
Join LeftOuter, (hashc4#20 = hashc4#42)
:- Project [hash(hashc3#18, 42) AS hashc4#20]
: +- Project [hash(hashc2#16, 42) AS hashc3#18]
: +- Project [hash(_c1#11, 42) AS hashc2#16]
: +- Relation[_c0#10,_c1#11,_c2#12] csv
+- Project [hash(hashc3#40, 42) AS hashc4#42]
+- Project [hash(hashc2#38, 42) AS hashc3#40]
+- Project [hash(_c1#33, 42) AS hashc2#38]
+- Relation[_c0#32,_c1#33,_c2#34] csv
== Analyzed Logical Plan ==
hashc4: int, hashc4: int
Join LeftOuter, (hashc4#20 = hashc4#42)
:- Project [hash(hashc3#18, 42) AS hashc4#20]
: +- Project [hash(hashc2#16, 42) AS hashc3#18]
: +- Project [hash(_c1#11, 42) AS hashc2#16]
: +- Relation[_c0#10,_c1#11,_c2#12] csv
+- Project [hash(hashc3#40, 42) AS hashc4#42]
+- Project [hash(hashc2#38, 42) AS hashc3#40]
+- Project [hash(_c1#33, 42) AS hashc2#38]
+- Relation[_c0#32,_c1#33,_c2#34] csv
== Optimized Logical Plan ==
Join LeftOuter, (hashc4#20 = hashc4#42)
:- Project [hash(hash(hash(_c1#11, 42), 42), 42) AS hashc4#20]
: +- Relation[_c0#10,_c1#11,_c2#12] csv
+- Project [hash(hash(hash(_c1#33, 42), 42), 42) AS hashc4#42]
+- Relation[_c0#32,_c1#33,_c2#34] csv
== Physical Plan ==
SortMergeJoin [hashc4#20], [hashc4#42], LeftOuter
:- *(2) Sort [hashc4#20 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(hashc4#20, 200)
: +- *(1) Project [hash(hash(hash(_c1#11, 42), 42), 42) AS hashc4#20]
: +- *(1) FileScan csv [_c1#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file: some-valid-file.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_c1:string>
+- *(4) Sort [hashc4#42 ASC NULLS FIRST], false, 0
+- ReusedExchange [hashc4#42], Exchange hashpartitioning(hashc4#20, 200)
上面的physical plan仅读取CSV一次并重新使用所有计算,因为Spark检测到两个FileScan
是相同的(即Spark知道它们不是独立的)。
现在考虑是否要用手工制作的独立但相同的RDD替换read.csv
。
from pyspark.sql.functions import hash
def g(spark):
df=spark.createDataFrame([('a', 'a'), ('b', 'b'), ('c', 'c')], ["_c1", "_c2"])
df2=df.select(hash('_c1').alias('hashc2'))
df3=df2.select(hash('hashc2').alias('hashc3'))
df4=df3.select(hash('hashc3').alias('hashc4'))
return df4
df_c = g(spark)
df_d = g(spark)
df_joined = df_c.join(df_d, df_c.hashc4==df_d.hashc4, how='left')
在这种情况下,Spark的物理计划扫描两个不同的RDD。这是运行df_joined.explain(extended=True)
进行确认的输出。
== Parsed Logical Plan ==
Join LeftOuter, (hashc4#8 = hashc4#18)
:- Project [hash(hashc3#6, 42) AS hashc4#8]
: +- Project [hash(hashc2#4, 42) AS hashc3#6]
: +- Project [hash(_c1#0, 42) AS hashc2#4]
: +- LogicalRDD [_c1#0, _c2#1], false
+- Project [hash(hashc3#16, 42) AS hashc4#18]
+- Project [hash(hashc2#14, 42) AS hashc3#16]
+- Project [hash(_c1#10, 42) AS hashc2#14]
+- LogicalRDD [_c1#10, _c2#11], false
== Analyzed Logical Plan ==
hashc4: int, hashc4: int
Join LeftOuter, (hashc4#8 = hashc4#18)
:- Project [hash(hashc3#6, 42) AS hashc4#8]
: +- Project [hash(hashc2#4, 42) AS hashc3#6]
: +- Project [hash(_c1#0, 42) AS hashc2#4]
: +- LogicalRDD [_c1#0, _c2#1], false
+- Project [hash(hashc3#16, 42) AS hashc4#18]
+- Project [hash(hashc2#14, 42) AS hashc3#16]
+- Project [hash(_c1#10, 42) AS hashc2#14]
+- LogicalRDD [_c1#10, _c2#11], false
== Optimized Logical Plan ==
Join LeftOuter, (hashc4#8 = hashc4#18)
:- Project [hash(hash(hash(_c1#0, 42), 42), 42) AS hashc4#8]
: +- LogicalRDD [_c1#0, _c2#1], false
+- Project [hash(hash(hash(_c1#10, 42), 42), 42) AS hashc4#18]
+- LogicalRDD [_c1#10, _c2#11], false
== Physical Plan ==
SortMergeJoin [hashc4#8], [hashc4#18], LeftOuter
:- *(2) Sort [hashc4#8 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(hashc4#8, 200)
: +- *(1) Project [hash(hash(hash(_c1#0, 42), 42), 42) AS hashc4#8]
: +- Scan ExistingRDD[_c1#0,_c2#1]
+- *(4) Sort [hashc4#18 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(hashc4#18, 200)
+- *(3) Project [hash(hash(hash(_c1#10, 42), 42), 42) AS hashc4#18]
+- Scan ExistingRDD[_c1#10,_c2#11]
这并不是PySpark特有的行为。