需要指定4列中的所有重复记录Spark Dataset API org.apache.spark.sql.Dataset [org.apache.spark.sql.Row]

时间:2019-02-27 19:05:59

标签: scala apache-spark apache-spark-dataset

基于这4列(db_name,tb_name,column_name,latest_partition),我需要所有重复值

     scala> val df = spark.table("dbname.tbname").distinct()

df架构

     scala> df.printSchema
                root
                 |-- column_name: string (nullable = true)
                 |-- latest_partition: string (nullable = true)
                 |-- row_cnt: string (nullable = true)
                 |-- comments: string (nullable = true)
                 |-- db_name: string (nullable = true)
                 |-- tb_name: string (nullable = true)
                 |-- Date_Processed: date (nullable = true)  

从df获取唯一值

    scala> val uniqueData = df.dropDuplicates("column_name","latest_partition","db_name","tb_name") 

我正在尝试从这些指定的4列中获取所有重复记录

    scala> val Dup = df.withColumn("cnt", count("*").over(Window.partitionBy("column_name","latest_partition","db_name","tb_name"))).where(col("cnt") > 1)

    scala> Dup.count
    res8: Long = 0

您能帮我如何从数据集df中获取所有重复值吗?

0 个答案:

没有答案