hiveql删除重复项,包括具有重复项的记录

时间:2018-10-29 20:01:52

标签: scala apache-spark hadoop hive

我有一条存储在数据框中的选择语句。...

val df = spark.sqlContext.sql("select prty_tax_govt_issu_id from CST_EQUIFAX.eqfx_prty_emp_incm_info where emp_mtch_cd = 'Y' and emp_mtch_actv_rcrd_in = 'Y' and emp_sts_in = 'A'");

然后我要采用此数据框,仅选择唯一记录。因此,请确定prty_tax_govt_issu_id字段上的所有重复项,如果有重复项,不仅要删除重复项,还要删除具有该prty_tax_govt_issu_id的整个记录​​

所以原始数据框可能看起来像...

+---------------------+
|prty_tax_govt_issu_id|
+---------------------+
|            000000005|
|            000000012|
|            000000012|
|            000000028|
|            000000038|
+---------------------+

新的数据框应如下所示。...

|prty_tax_govt_issu_id|
+---------------------+
|            000000005|
|            000000028|
|            000000038|
+---------------------+

不确定在存储在数据框中后是否需要执行此操作,或者是否可以在我的select语句中获得该结果。谢谢:)

2 个答案:

答案 0 :(得分:1)

计算每个id的行数,然后选择count = 1的行。

val df = spark.sql("select prty_tax_govt_issu_id from CST_EQUIFAX.eqfx_prty_emp_incm_info where emp_mtch_cd = 'Y' and emp_mtch_actv_rcrd_in = 'Y' and emp_sts_in = 'A'")
// Get counts per id
val counts = df.groupBy("prty_tax_govt_issu_id").count()
// Filter for id's having only one row
counts.filter($"count" == 1).select($"prty_tax_govt_issu_id").show()

在SQL中,您可以

val df = spark.sql("""
                    select prty_tax_govt_issu_id 
                    from CST_EQUIFAX.eqfx_prty_emp_incm_info
                    where emp_mtch_cd = 'Y' and emp_mtch_actv_rcrd_in = 'Y' and emp_sts_in = 'A' 
                    group by prty_tax_govt_issu_id 
                    having count(*)=1
                   """)   
df.show() 

答案 1 :(得分:0)

group by子句可以做到

select prty_tax_govt_issu_id 
from CST_EQUIFAX.eqfx_prty_emp_incm_info 
where emp_mtch_cd = 'Y' 
and emp_mtch_actv_rcrd_in = 'Y' 
and emp_sts_in = 'A'
GROUP BY prty_tax_govt_issu_id