在Hive和Presto中按集合并按顺序聚合字符串

时间:2017-04-24 17:54:00

标签: hive hiveql presto

我有一个表格格式如下:

IDX IDY Time Text idx1 idy1 t1 text1 idx1 idy2 t2 text2 idx1 idy2 t3 text3 idx1 idy1 t4 text4 idx2 idy3 t5 text5 idx2 idy3 t6 text6 idx2 idy1 t7 text7 idx2 idy3 t8 text8

我希望看到的是这样的:

idx1 text1 idx1 text2, text3 idx1 text4 idx2 text5, text6 idx2 text7 idx2 text8 所以在最后阶段,我可以到达:

text1 text2, text3 text4 ==SEPERATOR== text5, text6 text7 text8

如何在Hive或Presto中执行此操作?谢谢。

1 个答案:

答案 0 :(得分:1)

<强>蜂房

这是基本查询,如果您愿意,可以从此处获取

select  IDX
       ,IDY
       ,min(time)                           as from_time
       ,max(time)                           as to_time
       ,concat_ws(',',collect_list (Text))  as text

from   (select  *
               ,row_number () over 
                (
                    partition by    IDX
                    order by        Time
                )   as rn
               ,row_number () over 
                (
                    partition by    IDX,IDY
                    order by        Time
                )   as rn_IDY

        from    mytable
        ) t

group by    IDX,IDY
           ,rn - rn_IDY

order by    IDX,from_time
+------+------+-----------+---------+-------------+
| idx  | idy  | from_time | to_time |    text     |
+------+------+-----------+---------+-------------+
| idx1 | idy1 | t1        | t1      | text1       |
| idx1 | idy2 | t2        | t3      | text2,text3 |
| idx1 | idy1 | t4        | t4      | text4       |
| idx2 | idy3 | t5        | t6      | text5,text6 |
| idx2 | idy1 | t7        | t7      | text7       |
| idx2 | idy3 | t8        | t8      | text8       |
+------+------+-----------+---------+-------------+

<强>的Presto

select  array_join(array_agg (Text),',')   as text

from   (select  *
               ,row_number () over 
                (
                    partition by    IDX
                    order by        Time
                )   as rn
               ,row_number () over 
                (
                    partition by    IDX,IDY
                    order by        Time
                )   as rn_IDY

        from    mytable
        ) t

group by    IDX,IDY
           ,rn - rn_IDY

order by    IDX,min(time)
;
+-------------+
|    text     |
+-------------+
| text1       |
| text2,text3 |
| text4       |
| text5,text6 |
| text7       |
| text8       |
+-------------+