Hive - 创建数据集,用最常见的值替换所有值

时间:2017-07-18 12:41:15

标签: sql hadoop hive

我需要创建一个数据集,该数据集包含与源表相同的行,但是将出生日期替换为为该人找到的最常见的出生日期值。如果存在平局,则应使用最近的日期。

输入

id  first_name  last_name  dob       date
---------------------------------------------
1   john        doe        06/11/85  01/01/17
2   john        doe        06/11/86  01/01/17
3   john        doe        06/11/86  01/01/17
4   jane        doh        01/06/87  01/01/17
5   jane        doh        01/01/80  01/02/17

输出

1 john doe 06/11/86 01/01/17
2 john doe 06/11/86 01/01/17
3 john doe 06/11/86 01/01/17
4 jane doh 01/01/80 01/01/17
5 jane doh 01/01/80 01/02/17

John Doe更新于06/11/86(最常见)。 jane doh更新到01/01/80(打破断路器)。

我最近的尝试基于一个类似的例子:

SELECT a.id, a.first_name, a.last_name, a.date, b.id  FROM 
(SELECT first_name, last_name,dob,count(*) FROM table group by first_name, last_name,dob having count(*) in 
(SELECT max(total) AS freq FROM 
(SELECT first_name, last_name, dob, count(*) AS total FROM table group by first_name, last_name, dob) 
AS test_temp group by first_name, last_name)
) a   join (select * FROM table) b on (a.id = b.id)

我不想要一个解决方案,但也想要一个我可以学习的解释。

2 个答案:

答案 0 :(得分:0)

SELECT a.id, a.first_name, a.last_name, b.dob, a.date FROM table a JOIN (SELECT DISTINCT id, first_name, last_name, dob, count(dob) AS cnt FROM table ORDER BY cnt DESC LIMIT 1) b ON (a.first_name=b.first_name) AND (a.last_name=b.last_name)

我会尝试这个。我使用subselect加入了基表,以获得最常见的dob。 ORDER BY cnt DESC LIMIT 1 max(count(dob)) firt_name {@ 1}} last_name {@}} {}}}然后我就把这个dob加入到具有相同include platform/$(PLATFORM).mk platform的每条记录中。我希望能帮到你。

答案 1 :(得分:0)

您可以使用first_value()功能指定出生日期,而不是JOIN

  select t.id, t.first_name, t.last_name,
         first_value(dob) over (partition by first_name, last_name
                                order by dob_cnt desc, date desc
                                rows between unbounded preceding and current row
                               ) as dob_imputed
  from (select t.*,
               count(*) over (partition by first_name, last_name, dob) as dob_cnt
        from t
       ) t