在Hive SQL中按计数删除重复的行?

时间:2015-06-11 15:22:26

标签: sql hive hiveql

有些文章确实在堆栈上有所帮助,但是,无法通过Hive中的计数删除行。

Apple有2个row_counts。如何仅为Apple选择1行计数?

- 看起来是什么数据......总共14条记录

customerID     date product_type            
1234abc       20140105  Orange      
1234abc       20140105  Apple       
1234abc       20140205  Orange      
1234abc       20140205  Apple       
1234abc       20140205  Apple       
1234abc       20140305  Orange      
1234abc       20140305  Apple       
1234abc       20140305  Apple       
1234abc       20140405  Orange      
1234abc       20140405  Apple       
1234abc       20140405  Apple       
1234abc       20140505  Orange      
1234abc       20140505  Apple       
1234abc       20140505  Apple       

- 最终输出。共10条记录

customerID     date product_type    
1234abc       20140105  Orange      
1234abc       20140105  Apple       
1234abc       20140205  Orange      
1234abc       20140205  Apple       
1234abc       20140305  Orange      
1234abc       20140305  Apple       
1234abc       20140405  Orange      
1234abc       20140405  Apple       
1234abc       20140505  Orange      
1234abc       20140505  Apple       

2 个答案:

答案 0 :(得分:1)

从your_table中选择不同的customerID,date,product_type

答案 1 :(得分:0)

我建议采用两步法。步骤1:创建一个临时表,插入重复记录列表,使用insert和select,如下所示:

CREATE TABLE #Temp( product_Name Char( 30 ), Date Date, CustomerID int );
INSERT INTO #temp (product_Name, Date, CustomerID)
select x.dup, x.[Product_name] as nameX
      , x.[Date]  as dateX, x.CustomerID
from (
SELECT count(*) as dup
      ,[Product_Name]
      , CustonmerID
      ,[TestDate]
  FROM dbo.[yourtable]
  group by  [Date] ,[Product_Name], CustomerID ) x
  where dup > 1

然后用

删除重复项
 delete  from 
 dbo.[originaltable] 
 where EXISTS (SELECT product_Name, Date, CustomerID from #Temp WHERE Product_Name= [dbo].[originaltable].Product_Name and Date=[dbo].[originalTable].Date )  

步骤2:将#temp表内容插入到原始表中。