Spark 1.6.0 DataFrame selfjoin问题

时间:2016-10-12 04:25:35

标签: scala apache-spark spark-dataframe

我正在尝试使用DataFrame Scala API执行selfjoin。 这是我的代码片段; 你能告诉我第一个解决方案有什么问题吗?

val df = sqlc.read.json(“empMgr.json”);

empMgr.json

{ “ID”:101, “ENAME”: “Peter” 的 “SAL”:24.24, “部门”: “11”, “国”: “美国”, “DOJ”:“2017年1月12日”, “经理”:201} { “ID”:201, “ENAME”: “约翰”, “SAL”:1300, “部门”: “232”, “国”: “IN”, “DOJ”: “2016年4月22日”,”经理“:111} { “ID”:301, “ENAME”: “山姆”, “部门”: “22”, “国”: “KR”, “DOJ”: “2015年5月22日”, “经理”:201} < / p>

// 1. following is not working
var df_right=df; 
df.join(df_right, df("mgr") === df_right("ID")).show()
df.join(df, df("mgr") === df("ID")).show()

/*
 * Output:
 * +---+-------+----+---+-----+---+---+---+-------+----+---+-----+---+---+
    | ID|country|dept|doj|ename|mgr|sal| ID|country|dept|doj|ename|mgr|sal|
    +---+-------+----+---+-----+---+---+---+-------+----+---+-----+---+---+
    +---+-------+----+---+-----+---+---+---+-------+----+---+-----+---+---+
 * */


//2. following works fine
df_right= sqlc.read.json("file:///opt/data/empMgr.json");  
df.join(df_right, df("mgr") === df_right("ID")).show()

/*
 *Output:
 * +---+-------+----+---------+-----+---+-----+---+-------+----+---------+-----+---+------+
    | ID|country|dept|      doj|ename|mgr|  sal| ID|country|dept|      doj|ename|mgr|   sal|
    +---+-------+----+---------+-----+---+-----+---+-------+----+---------+-----+---+------+
    |101|     US|  11|1/12/2017|Peter|201|24.24|201|     IN| 232|4/22/2016| John|111|1300.0|
    |301|     KR|  22|5/22/2015|  Sam|201| null|201|     IN| 232|4/22/2016| John|111|1300.0|
    +---+-------+----+---------+-----+---+-----+---+-------+----+---------+-----+---+------+ 
 * */


//3. following works fine
df.registerTempTable("empMgr")
sqlc.sql("select b.ename, a.ename as mgr,b.mgr from empMgr a join empMgr b on a.ID=b.mgr").show();

/*
 * output
 * +-----+----+---+
  |ename| mgr|mgr|
  +-----+----+---+
  |Peter|John|201|
  |  Sam|John|201|
  +-----+----+---+
 * */

1 个答案:

答案 0 :(得分:2)

使用Dataframe的as()方法在引用相似名称时消除歧义。

df.as("a").join(df.as("b"), $"a.mgr" === $"b.ID").show

+---+-------+----+---------+-----+---+-----+---+-------+----+---------+-----+---+------+
| ID|country|dept|      doj|ename|mgr|  sal| ID|country|dept|      doj|ename|mgr|   sal|
+---+-------+----+---------+-----+---+-----+---+-------+----+---------+-----+---+------+
|101|     US|  11|1/12/2017|Peter|201|24.24|201|     IN| 232|4/22/2016| John|111|1300.0|
|301|     KR|  22|5/22/2015|  Sam|201| null|201|     IN| 232|4/22/2016| John|111|1300.0|
+---+-------+----+---------+-----+---+-----+---+-------+----+---------+-----+---+------+