消除列中的重复项

时间:2019-10-31 18:28:36

标签: scala apache-spark

我可以消除Column_3,Column_4中的多个值

+--------+--------+--------+--------+
|Column_1|Column_2|Column_3|Column_4|
+--------+--------+--------+--------+
|       1|       x|     abc|     www|
|       1|       x|     abc|     sdf|
|       1|       x|     abc|     xyz|
|       1|       x|     def|     www|
|       1|       x|     def|     sdf|
|       1|       x|     def|     xyz|
+--------+--------+--------+--------+

预期产量

+--------+--------+--------+--------+
|Column_1|Column_2|Column_3|Column_4|
+--------+--------+--------+--------+
|       1|       x|     abc|     www|
|       1|       x|     def|     sdf|
|       1|       x|    null|     xyz|
+--------+--------+--------+--------+

1 个答案:

答案 0 :(得分:0)

使用df.dropDuplicates(Column_3,Column_4)

另外,请复制Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame