我有两个示例数据框df_a
和df_b
df_a
+----+------+-----------+-----------+
| id | name | mobile1 | address |
+----+------+-----------+-----------+
| 1 | Matt | 123456798 | |
+----+------+-----------+-----------+
| 2 | John | 123456798 | |
+----+------+-----------+-----------+
| 3 | Lena | | |
+----+------+-----------+-----------+
df_b
+----+------+-----------+-------+---------+
| id | name | mobile2 | city | country |
+----+------+-----------+-------+---------+
| 3 | Lena | 123456798 |Detroit| USA |
+----+------+-----------+-------+---------+
我试图从以下两个中选择某些列
df_a.join(df_b, df_a("id") <=> df_b("id"), "left_outer").select(
df_a("name"), df_a("id"), df_a("address"),
coalesce(df_a("mobile1"), df_b("mobile2"), lit(0))
)
我想对df_a
列数量很大的两个实际数据帧进行类似的操作。我想按特定顺序选择df_a
中的所有列和df_b
中的两列。所以我尝试了以下
val df_a_cols : String = "DFA.name,DFA.id,DFA.address"
df_a.as("DFA").join(df_b, df_a("id") <=> df_b("id"), "left_outer")
.select(
df_a_cols,
coalesce(df_a("mobile1"), df_b("mobile2"), lit(0))
)
和
val df_a_cols : String = "DFA.name,DFA.id,DFA.address"
df_a.as("DFA").join(df_b, df_a("id") <=> df_b("id"), "left_outer")
.selectExpr(
df_a_cols,
coalesce(df_a("mobile1"), df_b("mobile2"), lit(0))
)
但显然我为select
和selectExpr
有人可以帮帮我吗?我正在使用Spark 1.5.0。
更新
我尝试了以下
val df_a_cols : String = "DFA.name,DFA.id,DFA.address"
df_a.as("DFA").join(df_b.as("DFB"), df_a("id") <=> df_b("id"), "left_outer")
.select(
df_a_cols+",coalesce(DFA.mobile1, DFB.mobile2, 0)"
)
并收到错误
org.apache.spark.sql.AnalysisException: cannot resolve 'DFA.name,DFA.id,DFA.address,coalesce(DFA.mobile1, DFB.mobile2, 0) ' given input columns id, name, mobile1, address, mobile2, city, country;
然后我试了
val df_a_cols : String = "name,id,address"
df_a.as("DFA").join(df_b.as("DFB"), df_a("id") <=> df_b("id"), "left_outer")
.select(
df_a_cols+",coalesce(mobile1, mobile2, 0)"
)
得到了
org.apache.spark.sql.AnalysisException: cannot resolve ' name,id,address,coalesce(mobile1, mobile2, 0) ' given input columns id, name, mobile1, address, mobile2, city, country;
使用
val df_a_cols : String = "name,id,address"
df_a.as("DFA").join(df_b.as("DFB"), df_a("id") <=> df_b("id"), "left_outer")
.selectExpr(
df_a_cols+",coalesce(mobile1, mobile2, 0)"
)
我得到了
java.lang.RuntimeException: [1.10] failure: identifier expected
name,id,address,coalesce(mobile1, mobile2, 0)
^