根据列的子集筛选出重复的行

时间:2015-05-27 23:07:48

标签: hadoop hive hiveql

我有一些看起来像这样的数据:

ID,DateTime,Category,SubCategory
X01,2014-02-13T12:36:14,Clothes,Tshirts
X01,2014-02-13T12:37:16,Clothes,Tshirts
X01,2014-02-13T12:38:33,Shoes,Running
X02,2014-02-13T12:39:23,Shoes,Running
X02,2014-02-13T12:40:42,Books,Fiction
X02,2014-02-13T12:41:04,Books,Fiction

我想做的是只保留每个数据点的一个实例(我不关心哪个实例):

ID,DateTime,Category,SubCategory
X01,2014-02-13T12:36:14,Clothes,Tshirts
X02,2014-02-13T12:39:23,Shoes,Running
X02,2014-02-13T12:40:42,Books,Fiction

不幸的是,根据Hive Language Manual,Hive的DISTINCT表达式适用于整个表,所以做这样的事情不是一个选择:

SELECT DISTINCT(ID, SubCategory),
       DateTime,
       Category
FROM sometable

如何获得上面的第二张表?提前谢谢!

1 个答案:

答案 0 :(得分:1)

SQL中这类事情的常用方法是:

select ID, category, subcategory, min(datetime) datetime
from sometable
group by ID, category, subcategory