hive group by包含重复结果

时间:2016-07-01 15:52:09

标签: hadoop hive

我的hql是:

select day,app_id,platform,count(1) as dau from (select ad.day,ads.app_id,ads.platform,ad.deviceid from mobile_ad_space ads inner join (select day,deviceid,flightid from mobile_day_adlog where day='$STAT_DAY') ad on cast(ads.space_id as string)=ad.flightid   group by ad.day,ads.app_id,ads.platform,ad.deviceid) day_active group by day,app_id,platform having dau>5;

但结果包含重复数据:

| day        | appid | platform | dau|

| 2016-06-29 |   1 | ios      | 70533 |

| 2016-06-29 |   1 | android  | 49307 |

| 2016-06-29 |   1 | android  | 49307 |

| 2016-06-29 |   1 | android  | 49307 |

我也尝试了其他的hql:

SELECT day_active.day  ,day_active.app_id  ,day_active.platform  ,count(1) AS dau FROM (  SELECT day_device.day AS day   ,day_device.app_id AS app_id   ,day_device.platform AS platform   ,day_device.deviceid AS deviceid  FROM     (SELECT ad.day AS day   ,ads.app_id AS app_id   ,ads.platform AS platform   ,ad.deviceid AS deviceid  FROM mobile_ad_space ads  INNER JOIN (   SELECT day    ,deviceid    ,flightid   FROM mobile_day_adlog   WHERE day = '$STAT_DAY'   ) ad ON cast(ads.space_id AS string) = ad.flightid) day_device                GROUP BY day_device.day   ,day_device.app_id   ,day_device.platform   ,day_device.deviceid  ) day_active GROUP BY day_active.day  ,day_active.app_id  ,day_active.platform HAVING dau>5 ORDER BY day_active.day  ,day_active.app_id  ,day_active.platform;

但它仍有这个问题

有谁可以帮助我?

2 个答案:

答案 0 :(得分:0)

尝试检查值是否包含空格,在group by中使用trim()。还要按列检查每个组的不同值的数量:select distinct platform ...,依此类推

答案 1 :(得分:0)

最后,我找到了解决方案。

在我的HQL中,我使用insert overwrite directory '$HQL_OUT_PATH'覆盖输出。但它似乎不稳定导致重复项。因此,我在HQL之前清理输出路径,结果现在变成了〜 !