我想对数据帧的每一列(列值是动态输入)进行重复和唯一计数。
我的代码:
var s = "id|name|nct_id|etc"
var result = s.replace('|', ',')
val dup_cols = result.split(",").toList
val conditionDF= patternsqlDF.select(patternsqlDF.col("id"), patternsqlDF.col("name"))
val aggCols2 = conditionDF.columns.map(dup_cols => {
val colDF=patternsqlDF.select(patternsqlDF.col(dup_cols)) colDF.withColumn("count",count("*").over(Window.partitionBy(dup_cols))).where($"count">1).show()
})
示例输入DF:(inputDF.Show())
-------------------------------------------------------------------------------------
| Id | name | Phone | Email | city |
+ --------------+------------+-----------------+---------------------+---------------+
| 2 | Ram | 9876543210 | ram@gmail.com | Pune |
| 3 | Sam | 8765432104 | sam@gmail.com | Bangalore |
| 4 | Sugu | 7655555555 | sam@gmail.com | Hyderabad |
| 3 | Sam | 8765042222 | sam@gmail.com | Chennai |
| 5 | Sugu | 9876543210 | sugu95@gmail.com | Mysore |
--------------------------------------------------------------------------------------
输出数据框应如下所示
在这里,我要对每列进行重复计数(注意:-列是动态输入-这可能会增加或减少,数据框中包含任何列数)
----------------------------------------------------------------
| name | Phone | Email | city |
+-------------+-----------------+------------------------------+
| 4 | 2 | 3 | 0 |
---------------------------------------------------------------
2。唯一计数输出DF
我想记录每一列的唯一记录数(注意:-列是动态输入的-这可能会增加或减少)
----------------------------------------------
| name | Email | city |
+-------------+--------------+----------------+
| 1 | 2 | 5 |
----------------------------------------------
有人可以告诉我如何计算数据框中每一列的重复值和唯一值吗?