如何使用Spark Scala Dataframe计算每列中重复和唯一值的数量?

时间:2018-11-23 13:17:49

标签: scala apache-spark apache-spark-sql spark-streaming apache-spark-dataset

我想对数据帧的每一列(列值是动态输入)进行重复和唯一计数。

我的代码:

  var s = "id|name|nct_id|etc"
  var result = s.replace('|', ',')
  val dup_cols = result.split(",").toList
  val conditionDF=  patternsqlDF.select(patternsqlDF.col("id"), patternsqlDF.col("name"))

val aggCols2 = conditionDF.columns.map(dup_cols => {  
val colDF=patternsqlDF.select(patternsqlDF.col(dup_cols))    colDF.withColumn("count",count("*").over(Window.partitionBy(dup_cols))).where($"count">1).show()
  })

示例输入DF:(inputDF.Show())

    -------------------------------------------------------------------------------------
   | Id            | name       | Phone           | Email               |  city         |
   + --------------+------------+-----------------+---------------------+---------------+
   |   2           | Ram        | 9876543210      | ram@gmail.com       | Pune          |
   |   3           | Sam        | 8765432104      | sam@gmail.com       | Bangalore     |
   |   4           | Sugu       | 7655555555      | sam@gmail.com       | Hyderabad     |
   |   3           | Sam        | 8765042222      | sam@gmail.com       | Chennai       |
   |   5           | Sugu       | 9876543210      | sugu95@gmail.com    | Mysore        |
   --------------------------------------------------------------------------------------

输出数据框应如下所示

  1. 重复计数输出DF

在这里,我要对每列进行重复计数(注意:-列是动态输入-这可能会增加或减少,数据框中包含任何列数)

              ----------------------------------------------------------------
              |  name       | Phone           | Email       | city           | 
              +-------------+-----------------+------------------------------+
              |  4          |   2             |  3          | 0              |
               ---------------------------------------------------------------

2。唯一计数输出DF

我想记录每一列的唯一记录数(注意:-列是动态输入的-这可能会增加或减少)

              ----------------------------------------------
              |  name       |  Email       | city           |  
              +-------------+--------------+----------------+
              |  1          |   2          | 5              |
               ----------------------------------------------          

有人可以告诉我如何计算数据框中每一列的重复值和唯一值吗?

0 个答案:

没有答案