Question

假设我有以下两个数据框

  df              df_type
+---+---+       +---+-------+
|  s|  o|       |  e| e_type|
+---+---+       +---+-------+
| s1| o1|       | s1|s1_type|
| s1| 10|       | o1|o1_type|
| s1| o3|       | s1|s1_type|
| s2| o1|       | s1|s1_type|
| s2| o2|       | o3|o3_type|
+---+---+       | s2|s2_type|
                | o1|o1_type|
                | s2|s2_type|
                | o2|o2_type|
                +---+-------+

目标是从df_type表中获取df中各列的类型，如下所示

+---+---+---+---+---------
|  s|  o|  s_type|  o_type|
+---+---+---+---+----+---+
| s1| o1| s1_type| o1_type|
| s1| 10| s1_type| null   |
| s1| o3| s1_type| o3_type|
| s2| o1| s2_type| o1_type|
| s2| o2| s2_type| o2_type|
+---+---+---+---+---------

出于效率目的，并且为了避免两个数据框之间的多次联接，我使用了以下查询

df.join(df_type, (col('e') == col('s')) | (col('e') == col('o')),'left')
df.groupBy(['s','o']).agg(collect_list(when(col('e')==col('s'),col('e_type'))).alias('s_type'),collect_list(when(col('e')==col('o'),col('e_type'))).alias('o_type')).withColumn('s_type',explode('s_type')).withColumn('o_type',explode('o_type'))

替换旧查询

df = df.join(df_type, col('e') == col('s'),'left' ).drop('e').withColumnRenamed('e_type','s_type')
df = df.join(df_type, col('e') == col('o'), 'left' ).drop('e').withColumnRenamed('e_type','o_type')

新方法在较小的数据集上效果很好，但是对于真实数据，旧方法（两个单独的联接）效果很好，但是df.join(df_type, (col('e') == col('s')) | (col('e') == col('o')),'left')崩溃，且GC开销限制超出了错误消息。

我已经完成了所有关于stackoverflow的建议，从禁用 autoBroadcastJoinThreshold 到增加分区和spark.driver.memory *，...可以解决问题，但没有运气。

我的问题是，为什么df.join(df_type, (col('e') == col('s')) | (col('e') == col('o')),'left')联接崩溃而旧的两次联接方法却能正常工作？

-----------------------------编辑---------------
这是有关新查询的额外信息

df.join(df_type, (col('e') == col('s')) | (col('e') == col('o')),'left').explain()
== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, LeftOuter, ((e#271 = s#4) || (e#271 = o#5))
:- *(1) Project [_1#0 AS s#4, _2#1 AS o#5]
:  +- Scan ExistingRDD[_1#0,_2#1]
+- BroadcastExchange IdentityBroadcastMode
   +- *(2) Project [_1#267 AS e#271, _2#268 AS e_type#272]
      +- Scan ExistingRDD[_1#267,_2#268]

vs是进行联接的旧方法

 df.join(df_type, (col('e') == col('s')),'left').explain()
== Physical Plan ==
SortMergeJoin [s#4], [e#271], LeftOuter
:- *(2) Sort [s#4 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(s#4, 200)
:     +- *(1) Project [_1#0 AS s#4, _2#1 AS o#5]
:        +- Scan ExistingRDD[_1#0,_2#1]
+- *(4) Sort [e#271 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(e#271, 200)
      +- *(3) Project [_1#267 AS e#271, _2#268 AS e_type#272]
         +- *(3) Filter isnotnull(_1#267)
            +- Scan ExistingRDD[_1#267,_2#268]

显然，这两种方法有很大不同，那么我应该怎么做才能使df.join(df_type, (col('e') == col('s')) | (col('e') == col('o')),'left')起作用？

我开始怀疑这种新方法是否比连续两个联接更有效？也许旧的方法实际上更好？

仅当在多个列上执行单个连接时，GC开销限制才超过消息错误

0 个答案: