数据帧左外连接在Spark中无法正常工作

时间:2018-03-22 16:15:04

标签: apache-spark apache-spark-sql left-join

我有两个以下架构的数据框:

clusterDF schema
root
 |-- cluster_id: string (nullable = true)

df schema
root
 |-- cluster_id: string (nullable = true)
 |-- name: string (nullable = true)

尝试使用

加入这些内容
val nameDF  = clusterDF.join(df, col("clusterDF.cluster_id") === col("df.cluster_id"), "left" )

但上面的代码失败了:

org.apache.spark.sql.AnalysisException: cannot resolve '`clusterDF.cluster_id`' given input columns: [cluster_id, cluster_id, name];;
'Join LeftOuter, ('clusterDF.cluster_id = 'df.cluster_id)
:- Aggregate [cluster_id#0], [cluster_id#0]
:  +- Project [cluster_id#0]
:     +- Filter (name#18 = kroger)
:        +- Project [cluster_id#0, name#18]
:           +- Generate explode(influencers#1.screenName), true, false, [name#18]
:              +- Relation[cluster_id#0,influencers#1] json
+- Project [cluster_id#26, name#18]
   +- Generate explode(influencers#27.screenName), true, false, [name#18]
      +- Relation[cluster_id#26,influencers#27] json

对我来说似乎很奇怪。请给我任何建议。

1 个答案:

答案 0 :(得分:2)

错误消息足够清楚

  
    

org.apache.spark.sql.AnalysisException:无法解析给定输入列的“clusterDF.cluster_id”:[cluster_id,cluster_id,name] ;;

  

表示您使用的列名称错误,请使用以下方法之一

val nameDF  = clusterDF.join(df, clusterDF("cluster_id") === df("cluster_id"), "left")

import org.apache.spark.sql.functions._
val nameDF  = clusterDF.as("table1").join(df.as("table2"), col("table1.cluster_id") === col("table2.cluster_id"), "left")

import spark.implicits._
val nameDF  = clusterDF.as("table1").join(df.as("table2"), $"table1.cluster_id" === $"table2.cluster_id"), "left")

或更新版本

val nameDF  = clusterDF.join(df, clusterDF('cluster_id) === df('cluster_id), "left")