Question

我在Hive 1.2.1中有以下数据（我的实际数据集要大得多，但数据结构相似）：

id    radar_id     car_id     datetime
1     A21          123        2017-03-08 17:31:19.0
2     A21          555        2017-03-08 17:32:00.0
3     A21          777        2017-03-08 17:33:00.0
4     B15          123        2017-03-08 17:35:22.0
5     B15          555        2017-03-08 17:34:05.0
6     B15          777        2017-03-08 20:50:12.0
7     C09          777        2017-03-08 20:55:00.0
8     A21          123        2017-03-09 11:00:00.0
9     C11          664        2017-03-09 11:10:00.0
10    A21          123        2017-03-09 11:12:00.0
11    A21          555        2017-03-09 11:12:10.0
12    B15          123        2017-03-09 11:14:00.0
13    B15          555        2017-03-09 11:20:00.0
14    A21          444        2017-03-09 10:00:00.0
15    C09          444        2017-03-09 10:20:00.0
16    B15          444        2017-03-09 10:05:00.0

我希望获得最常见的2条路线。路线是由radar_id排序的datetime序列。我想得到如下结果：

route          frequency
A21->B15       2
A21->B15-C09   1

频率是每天通过路线的车辆（非唯一，无需考虑car_id）的平均次数。对于路线A21->B15，频率为2，因为2017-03-08上有3个游乐设施，2017-03-09上有1个游乐设施。车辆123在日期A21->A21->B15上执行路线2017-03-09非常重要。它与A21->B15不同。所以，我想考虑从最初的雷达到白天捕获该车辆的最终雷达的路线。

乘车于23:55开始并于00:22结束的情况应被视为两条不同的路线。

如何使用Hive 1.2.1进行操作？

更新

根据答案中的建议，我测试了此查询，但route不包含->。路线的值类似于000021或0450001等。

df = sqlContext.sql("select      regexp_replace(route,'(?<=^|->)\\d{5}','')  as route " +
                                      ",count(*) / min(days)                        as frequency " +

                           "from       (select      concat_ws('->',sort_array(collect_list(radarids))) as route " +
                                                  ",count(distinct dt) over()                           as days " +
                                       "from       (select  car_id " +
                                                  ",to_date(datetime)   as dt " +
                                                  ",concat(printf('%05d',row_number() over " +
                                                  "(partition by car_id,to_date(datetime) " +
                                                  "order by to_unix_timestamp(datetime))),cast(radarid as string)) as radarids " +
                                                  "from    mytable " +
                                                  ") t " +
                                       "group by    car_id " +
                                      ",dt " +
                                      ") t " +
                           "group by    route " +      
                           "order by    frequency desc " +
                           "limit       5")

Answer 1

select      regexp_replace(route,'(?<=^|->)\\d{5}','')  as route
           ,count(*) / min(days)                        as frequency

from       (select      concat_ws('->',sort_array(collect_list(radar_ids))) as route
                       ,count(distinct dt) over()                           as days
            from       (select  car_id
                               ,to_date(datetime)   as dt
                               ,concat(printf('%05d',row_number() over (partition by car_id,to_date(datetime) order by datetime)),radar_id) as radar_ids
                        from    mytable
                        ) t
            group by    car_id
                       ,dt
            ) t
group by    route          
order by    frequency desc
limit       2 
;

+---------------+-----------+
| route         | frequency |
+---------------+-----------+
| A21->B15      | 1.5       |
+---------------+-----------+
| A21->B15->C09 | 1.0       |
+---------------+-----------+

Answer 2

似乎从documentation HIVE不支持递归CTE，但幸运的是它支持子查询，group by clasue，row_number分析函数，trunc(string date, string format)函数，{ {1}}函数和concat子句我无法访问Hive，但我可以展示如何在PostgreSQL上构建这样的查询，它们之间只有细微的差别所以我相信你设法重写它。我认为唯一要替换的是LIMIT x函数来自postgres，来自Hive的date_trunc('day', datetim )。

trunc(datetim , 'DD')

演示：http://sqlfiddle.com/#!15/53c7e/27

SELECT route, avg( cnt ) as average
FROM (
        SELECT concat(route1, '>', route2, '>', route3, '>', route4) as Route,
               count(*) as cnt
        FROM (
                SELECT date_trunc('day', datetim ) As datetim, car_id,
                    max( case when rn = 1 then radar_id end ) as route1,
                    max( case when rn = 2 then radar_id end ) as route2,
                    max( case when rn = 3 then radar_id end ) as route3,
                    max( case when rn = 4 then radar_id end ) as route4
                /*  max( case when rn = 5 then radar_id end ) as route5
                    ......
                    max( case when rn = N then radar_id end ) as routeN */
                FROM (
                    select t.*,
                           row_number() over (
                               partition by date_trunc('day', datetim ),car_id 
                               order by datetim 
                           ) as rn
                    from table111 t
                ) x
                GROUP BY date_trunc('day', datetim ), car_id
        ) x
        group by concat(route1, '>', route2, '>', route3, '>', route4)
) x
GROUP BY route
order by avg( cnt ) desc
LIMIT 2
;

如何从历史数据中获取最常用的路线？

2 个答案: