我在Hive 1.2.1中有以下数据(我的实际数据集要大得多,但数据结构相似):
id radar_id car_id datetime
1 A21 123 2017-03-08 17:31:19.0
2 A21 555 2017-03-08 17:32:00.0
3 A21 777 2017-03-08 17:33:00.0
4 B15 123 2017-03-08 17:35:22.0
5 B15 555 2017-03-08 17:34:05.0
6 B15 777 2017-03-08 20:50:12.0
7 C09 777 2017-03-08 20:55:00.0
8 A21 123 2017-03-09 11:00:00.0
9 C11 664 2017-03-09 11:10:00.0
10 A21 123 2017-03-09 11:12:00.0
11 A21 555 2017-03-09 11:12:10.0
12 B15 123 2017-03-09 11:14:00.0
13 B15 555 2017-03-09 11:20:00.0
14 A21 444 2017-03-09 10:00:00.0
15 C09 444 2017-03-09 10:20:00.0
16 B15 444 2017-03-09 10:05:00.0
我希望获得最常见的2条路线。路线是由radar_id
排序的datetime
序列。我想得到如下结果:
route frequency
A21->B15 2
A21->B15-C09 1
频率是每天通过路线的车辆(非唯一,无需考虑car_id
)的平均次数。
对于路线A21->B15
,频率为2,因为2017-03-08
上有3个游乐设施,2017-03-09
上有1个游乐设施。车辆123
在日期A21->A21->B15
上执行路线2017-03-09
非常重要。它与A21->B15
不同。所以,我想考虑从最初的雷达到白天捕获该车辆的最终雷达的路线。
乘车于23:55开始并于00:22结束的情况应被视为两条不同的路线。
如何使用Hive 1.2.1进行操作?
更新
根据答案中的建议,我测试了此查询,但route
不包含->
。路线的值类似于000021
或0450001
等。
df = sqlContext.sql("select regexp_replace(route,'(?<=^|->)\\d{5}','') as route " +
",count(*) / min(days) as frequency " +
"from (select concat_ws('->',sort_array(collect_list(radarids))) as route " +
",count(distinct dt) over() as days " +
"from (select car_id " +
",to_date(datetime) as dt " +
",concat(printf('%05d',row_number() over " +
"(partition by car_id,to_date(datetime) " +
"order by to_unix_timestamp(datetime))),cast(radarid as string)) as radarids " +
"from mytable " +
") t " +
"group by car_id " +
",dt " +
") t " +
"group by route " +
"order by frequency desc " +
"limit 5")
答案 0 :(得分:4)
select regexp_replace(route,'(?<=^|->)\\d{5}','') as route
,count(*) / min(days) as frequency
from (select concat_ws('->',sort_array(collect_list(radar_ids))) as route
,count(distinct dt) over() as days
from (select car_id
,to_date(datetime) as dt
,concat(printf('%05d',row_number() over (partition by car_id,to_date(datetime) order by datetime)),radar_id) as radar_ids
from mytable
) t
group by car_id
,dt
) t
group by route
order by frequency desc
limit 2
;
+---------------+-----------+
| route | frequency |
+---------------+-----------+
| A21->B15 | 1.5 |
+---------------+-----------+
| A21->B15->C09 | 1.0 |
+---------------+-----------+
答案 1 :(得分:1)
似乎从documentation HIVE不支持递归CTE,但幸运的是它支持子查询,group by
clasue,row_number
分析函数,trunc(string date, string format)
函数,{ {1}}函数和concat
子句
我无法访问Hive,但我可以展示如何在PostgreSQL上构建这样的查询,它们之间只有细微的差别所以我相信你设法重写它。我认为唯一要替换的是LIMIT x
函数来自postgres,来自Hive的date_trunc('day', datetim )
。
trunc(datetim , 'DD')
演示:http://sqlfiddle.com/#!15/53c7e/27
SELECT route, avg( cnt ) as average
FROM (
SELECT concat(route1, '>', route2, '>', route3, '>', route4) as Route,
count(*) as cnt
FROM (
SELECT date_trunc('day', datetim ) As datetim, car_id,
max( case when rn = 1 then radar_id end ) as route1,
max( case when rn = 2 then radar_id end ) as route2,
max( case when rn = 3 then radar_id end ) as route3,
max( case when rn = 4 then radar_id end ) as route4
/* max( case when rn = 5 then radar_id end ) as route5
......
max( case when rn = N then radar_id end ) as routeN */
FROM (
select t.*,
row_number() over (
partition by date_trunc('day', datetim ),car_id
order by datetim
) as rn
from table111 t
) x
GROUP BY date_trunc('day', datetim ), car_id
) x
group by concat(route1, '>', route2, '>', route3, '>', route4)
) x
GROUP BY route
order by avg( cnt ) desc
LIMIT 2
;