基于这4列(db_name,tb_name,column_name,latest_partition),我需要所有重复值
scala> val df = spark.table("dbname.tbname").distinct()
df架构
scala> df.printSchema
root
|-- column_name: string (nullable = true)
|-- latest_partition: string (nullable = true)
|-- row_cnt: string (nullable = true)
|-- comments: string (nullable = true)
|-- db_name: string (nullable = true)
|-- tb_name: string (nullable = true)
|-- Date_Processed: date (nullable = true)
从df获取唯一值
scala> val uniqueData = df.dropDuplicates("column_name","latest_partition","db_name","tb_name")
我正在尝试从这些指定的4列中获取所有重复记录
scala> val Dup = df.withColumn("cnt", count("*").over(Window.partitionBy("column_name","latest_partition","db_name","tb_name"))).where(col("cnt") > 1)
scala> Dup.count
res8: Long = 0
您能帮我如何从数据集df中获取所有重复值吗?