Question

假设我有一个数据帧表，其中C1和C2的列名如下：

+-----|-----|
|C1   | C2  |
+-----|-----|
|a    |  b  |
|c    |  d  |
|b    |  a  |
+-----|-----|

我希望从上表中删除逻辑重复，即（b，a）行。

我尝试使用自加入，但无法继续。

Answer 1

您可以创建一个值为C1和C2 的新列，对它们进行排序，然后使用dropDuplicates删除重复项（为清楚起见提供了评论）

import org.apache.spark.sql.functions._
df
  .withColumn("sortedCol", sort_array(array("C1", "C2")))  //creating a new sorted array column which contains the values of other columns
  .dropDuplicates("sortedCol")  //dropping duplicate columns which are logically same 
  .drop("sortedCol")    //removing the new column
  .show(false)

我希望答案会有所帮助

Answer 2

带有“除外”：

val df = List(
  ("a", "b"),
  ("c", "d"),
  ("b", "a")).toDF("C1", "C2")

df.except(df.where($"C1" > $"C2")).show(false)

输出：

+---+---+
|C1 |C2 |
+---+---+
|a  |b  |
|c  |d  |
+---+---+

如何从数据框中删除逻辑重复项？

2 个答案: