Question

我正在尝试删除重复列，同时仅保留唯一列，并且在加入后仅重复列中的一列。

例如：重复DataFrame

root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- loc: string (nullable = true)
|-- sal: string (nullable = true)
|-- name: string (nullable = true)
|-- loc: string (nullable = true)
|-- sal: string (nullable = true)


After removing duplicates, the output should be

root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- loc: string (nullable = true)
|-- sal: string (nullable = true)

任何帮助将不胜感激？

Answer 1

正如Shaido上面已经评论过的那样，你应该删除所有未用于连接的列，因为加入它们之后很难。（例如，如果在连接中没有使用loc和sal）

df2.drop("loc", "sal")

或

df1.drop("loc", "sal")

如果您在联接中使用列名（例如id和name），请执行

df1.join(df2, Seq("id", "name"))

Answer 2

我相信，如果你选择通用方法，那么下面的代码可能对你有帮助。在这里，您无需提及重复的列名称。

首先创建一个隐式类（更好的设计方法）

implicit class DataFrameOperations(df: DataFrame) {
  def dropDuplicateCols(rmvDF: DataFrame): DataFrame = {
    val cols = df.columns.groupBy(identity).mapValues(_.size).filter(_._2 > 1).keySet.toSeq

    @tailrec def deleteCol(df: DataFrame, cols: Seq[String]): DataFrame = {
      if (cols.size == 0) df else deleteCol(df.drop(rmvDF(cols.head)), cols.tail)
    }
    deleteCol(df, cols)
  }
}

要调用方法，您可以在下面使用

val dupDF = rdd1.join(rdd2,"id").dropDuplicateCols(rdd1)

Answer 3

//For exemple
val dataFrame = sparkSession.sql("SELECT .....")
dataFrame .distinct() //since 2.0.0
//or
dataFrame.dropDuplicates()
//or
dataFrame.dropDuplicates(colNames)

如何在保留唯一列（包括重复项中只有一列）的同时删除数据框中的重复列

3 个答案: