Spark / Scala-从多个数据帧中选择列

时间:2017-03-13 14:08:55

标签: scala apache-spark dataframe

我有两个示例数据框df_adf_b

df_a
+----+------+-----------+-----------+
| id | name | mobile1   | address   |
+----+------+-----------+-----------+
| 1  | Matt | 123456798 |           |
+----+------+-----------+-----------+
| 2  | John | 123456798 |           |
+----+------+-----------+-----------+
| 3  | Lena |           |           |
+----+------+-----------+-----------+

df_b 
+----+------+-----------+-------+---------+
| id | name | mobile2   | city  | country |
+----+------+-----------+-------+---------+
| 3  | Lena | 123456798 |Detroit|  USA    |
+----+------+-----------+-------+---------+

我试图从以下两个中选择某些列

df_a.join(df_b, df_a("id") <=> df_b("id"), "left_outer").select(
 df_a("name"), df_a("id"), df_a("address"),
 coalesce(df_a("mobile1"), df_b("mobile2"), lit(0))
)

我想对df_a列数量很大的两个实际数据帧进行类似的操作。我想按特定顺序选择df_a中的所有列和df_b中的两列。所以我尝试了以下

val df_a_cols : String = "DFA.name,DFA.id,DFA.address"
df_a.as("DFA").join(df_b, df_a("id") <=> df_b("id"), "left_outer")
.select(
 df_a_cols,
 coalesce(df_a("mobile1"), df_b("mobile2"), lit(0))
)

val df_a_cols : String = "DFA.name,DFA.id,DFA.address"
df_a.as("DFA").join(df_b, df_a("id") <=> df_b("id"), "left_outer")
.selectExpr(
 df_a_cols,
 coalesce(df_a("mobile1"), df_b("mobile2"), lit(0))
)

但显然我为selectselectExpr

提供了错误的论据类型

有人可以帮帮我吗?我正在使用Spark 1.5.0。

更新

我尝试了以下

val df_a_cols : String = "DFA.name,DFA.id,DFA.address"
df_a.as("DFA").join(df_b.as("DFB"), df_a("id") <=> df_b("id"), "left_outer")
.select(
 df_a_cols+",coalesce(DFA.mobile1, DFB.mobile2, 0)"
)

并收到错误

org.apache.spark.sql.AnalysisException: cannot resolve 'DFA.name,DFA.id,DFA.address,coalesce(DFA.mobile1, DFB.mobile2, 0) ' given input columns id, name, mobile1, address, mobile2, city, country;

然后我试了

val df_a_cols : String = "name,id,address"
df_a.as("DFA").join(df_b.as("DFB"), df_a("id") <=> df_b("id"), "left_outer")
.select(
 df_a_cols+",coalesce(mobile1, mobile2, 0)"
)

得到了

org.apache.spark.sql.AnalysisException: cannot resolve ' name,id,address,coalesce(mobile1, mobile2, 0) ' given input columns id, name, mobile1, address, mobile2, city, country;

使用

val df_a_cols : String = "name,id,address"
df_a.as("DFA").join(df_b.as("DFB"), df_a("id") <=> df_b("id"), "left_outer")
.selectExpr(
 df_a_cols+",coalesce(mobile1, mobile2, 0)"
)

我得到了

java.lang.RuntimeException: [1.10] failure: identifier expected

name,id,address,coalesce(mobile1, mobile2, 0)
    ^

0 个答案:

没有答案