Question

我有一个包含多列的spark数据帧。我想找出并删除列中具有重复值的行（其他列可以不同）。

我尝试使用dropDuplicates(col_name)，但它只会删除重复的条目，但仍会在数据框中保留一条记录。我需要的是删除最初包含重复条目的所有条目。

我正在使用Spark 1.6和Scala 2.10。

Answer 1

我会使用窗口函数。假设您要删除重复的id行：

import org.apache.spark.sql.expressions.Window

df
  .withColumn("cnt", count("*").over(Window.partitionBy($"id")))
  .where($"cnt"===1).drop($"cnt")
  .show()

Answer 2

这可以通过按列（或多列）分组来查找重复项，然后聚合并过滤结果。

示例数据框df：

+---+---+
| id|num|
+---+---+
|  1|  1|
|  2|  2|
|  3|  3|
|  4|  4|
|  4|  5|
+---+---+

按id列分组以删除其重复项（最后两行）：

val df2 = df.groupBy("id")
  .agg(first($"num").as("num"), count($"id").as("count"))
  .filter($"count" === 1)
  .select("id", "num")

这会给你：

+---+---+
| id|num|
+---+---+
|  1|  1|
|  2|  2|
|  3|  3|
+---+---+

另外，可以使用join来完成。它会更慢，但是如果有很多列，则不需要使用first($"num").as("num")来保留它们。

val df2 = df.groupBy("id").agg(count($"id").as("count")).filter($"count" === 1).select("id")
val df3 = df.join(df2, Seq("id"), "inner")

Answer 3

我在使用@Raphael Roth解决方案的开源spark-daria库中添加了一个killDuplicates()方法。这是使用代码的方法：

import com.github.mrpowers.spark.daria.sql.DataFrameExt._

df.killDuplicates(col("id"))

// you can also supply multiple Column arguments
df.killDuplicates(col("id"), col("another_column"))

这是代码实现：

object DataFrameExt {

  implicit class DataFrameMethods(df: DataFrame) {

    def killDuplicates(cols: Column*): DataFrame = {
      df
        .withColumn(
          "my_super_secret_count",
          count("*").over(Window.partitionBy(cols: _*))
        )
        .where(col("my_super_secret_count") === 1)
        .drop(col("my_super_secret_count"))
    }

  }

}

您可能想利用spark-daria库将这种逻辑排除在代码库之外。

删除spark数据帧中重复的所有记录

3 个答案: