为什么Spark认为这是一个交叉/笛卡尔联盟

时间:2017-02-27 02:51:20

标签: apache-spark dataframe pyspark apache-spark-sql

我想将数据加入两次,如下所示:

rdd1 = spark.createDataFrame([(1, 'a'), (2, 'b'), (3, 'c')], ['idx', 'val'])
rdd2 = spark.createDataFrame([(1, 2, 1), (1, 3, 0), (2, 3, 1)], ['key1', 'key2', 'val'])

res1 = rdd1.join(rdd2, on=[rdd1['idx'] == rdd2['key1']])
res2 = res1.join(rdd1, on=[res1['key2'] == rdd1['idx']])
res2.show()

然后我收到一些错误:

  

pyspark.sql.utils.AnalysisException:u'笛卡尔联接可能是   非常昂贵,默认情况下禁用。要明确启用它们,请设置spark.sql.crossJoin.enabled = true;'

但我认为这不是交叉加入

更新:

res2.explain()

== Physical Plan ==
CartesianProduct
:- *SortMergeJoin [idx#0L, idx#0L], [key1#5L, key2#6L], Inner
:  :- *Sort [idx#0L ASC, idx#0L ASC], false, 0
:  :  +- Exchange hashpartitioning(idx#0L, idx#0L, 200)
:  :     +- *Filter isnotnull(idx#0L)
:  :        +- Scan ExistingRDD[idx#0L,val#1]
:  +- *Sort [key1#5L ASC, key2#6L ASC], false, 0
:     +- Exchange hashpartitioning(key1#5L, key2#6L, 200)
:        +- *Filter ((isnotnull(key2#6L) && (key2#6L = key1#5L)) && isnotnull(key1#5L))
:           +- Scan ExistingRDD[key1#5L,key2#6L,val#7L]
+- Scan ExistingRDD[idx#40L,val#41]

3 个答案:

答案 0 :(得分:9)

这是因为你(defn split-at' [n v] [(subvec v 0 n) (subvec v n)]) (defn move [m n] (let [{:keys [a b]} m [left right] (split-at' (- (count a) n) a)] {:a left :b (into b right)})) 结构共享相同的谱系,这导致了一个平凡的条件:

join

res2.explain()

如果是这样你应该使用别名:

== Physical Plan ==
org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER join between logical plans
Join Inner, ((idx#204L = key1#209L) && (key2#210L = idx#204L))
:- Filter isnotnull(idx#204L)
:  +- LogicalRDD [idx#204L, val#205]
+- Filter ((isnotnull(key2#210L) && (key2#210L = key1#209L)) && isnotnull(key1#209L))
   +- LogicalRDD [key1#209L, key2#210L, val#211L]
and
LogicalRDD [idx#235L, val#236]
Join condition is missing or trivial.
Use the CROSS JOIN syntax to allow cartesian products between these relations.;
from pyspark.sql.functions import col

rdd1 = spark.createDataFrame(...).alias('rdd1')
rdd2 = spark.createDataFrame(...).alias('rdd2')

res1 = rdd1.join(rdd2, col('rdd1.idx') == col('rdd2.key1')).alias('res1')
res1.join(rdd1, on=col('res1.key2') == col('rdd1.idx')).explain()

有关详细信息,请参阅SPARK-6459

答案 1 :(得分:3)

在第二次加入之前保留数据帧时,我也成功了。

类似的东西:

res1 = rdd1.join(rdd2, col('rdd1.idx') == col('rdd2.key1')).persist()

res1.join(rdd1, on=col('res1.key2') == col('rdd1.idx'))

答案 2 :(得分:0)

持久化对我不起作用。

我用DataFrames上的别名克服了它

from pyspark.sql.functions import col

df1.alias("buildings").join(df2.alias("managers"), col("managers.distinguishedName") == col("buildings.manager"))