我认为这个问题类似于其他一些问题,但不会被问到。
在Spark中,如何在删除重复列的情况下运行SQL查询?
例如,在spark上运行的SQL查询
select a.* from a
left outer join
select b.* from b
on a.id = b.id
在这种情况下如何删除重复列b.id?
我知道我们可以在Spark中使用其他步骤,例如提供alas或重命名列,但只是通过编写SQL查询来更快地删除重复的列吗?
答案 0 :(得分:1)
我有两个数据帧,df1和df2,并在id列的基础上执行连接操作。
scala> val df1 = Seq((1,"mahesh"), (2,"shivangi"),(3,"manoj")).toDF("id", "name")
df1: org.apache.spark.sql.DataFrame = [id: int, name: string]
scala> df1.show
+---+--------+
| id| name|
+---+--------+
| 1| mahesh|
| 2|shivangi|
| 3| manoj|
+---+--------+
scala> val df2 = Seq((1,24), (2,23),(3,24)).toDF("id", "age")
df2: org.apache.spark.sql.DataFrame = [id: int, age: int]
scala> df2.show
+---+---+
| id|age|
+---+---+
| 1| 24|
| 2| 23|
| 3| 24|
+---+---+
这是一个不正确的解决方案,其中连接列被定义为谓词。
df1("id") === df2("id")
错误的结果是id列在连接的数据框中重复:
scala> df1.join(df2, df1("id") === df2("id"), "left").show
+---+--------+---+---+
| id| name| id|age|
+---+--------+---+---+
| 1| mahesh| 1| 24|
| 2|shivangi| 2| 23|
| 3| manoj| 3| 24|
+---+--------+---+---+
正确的解决方案是将连接列定义为字符串Seq(" id")而不是表达式。然后加入的数据框没有重复的列。
scala> df1.join(df2, Seq("id"),"left").show
+---+--------+---+
| id| name|age|
+---+--------+---+
| 1| mahesh| 24|
| 2|shivangi| 23|
| 3| manoj| 24|
+---+--------+---+
有关详细信息,请参阅here
答案 1 :(得分:0)
从Spark 1.4.0开始,您可以通过两种方式使用join,列或joinExprs。使用第一种方式时,连接列只会在输出中出现一次。
/**
* Inner equi-join with another [[DataFrame]] using the given columns.
*
* Different from other join functions, the join columns will only appear once in the output,
* i.e. similar to SQL's `JOIN USING` syntax.
*
* {{{
* // Joining df1 and df2 using the columns "user_id" and "user_name"
* df1.join(df2, Seq("user_id", "user_name"))
* }}}
*
* Note that if you perform a self-join using this function without aliasing the input
* [[DataFrame]]s, you will NOT be able to reference any columns after the join, since
* there is no way to disambiguate which side of the join you would like to reference.
*
* @param right Right side of the join operation.
* @param usingColumns Names of the columns to join on. This columns must exist on both sides.
* @group dfops
* @since 1.4.0
*/
def join(right: DataFrame, usingColumns: Seq[String]): DataFrame = {
join(right, usingColumns, "inner")
}