我有一个看起来像这样的表
usr_id query_ts
12345 2019/05/13 02:06
123444 2019/05/15 04:06
123444 2019/05/16 05:06
12345 2019/05/16 02:06
12345 2019/05/15 02:06
它包含用户ID和用户运行查询的时间。该表中的每个条目都代表该ID在给定的时间戳记下运行1个查询。
我正在尝试制作这个:
usr_id day_1 day_2 … day_30
12345 31 13 15
123444 23 41 14
我想显示每个ID在最近30天内每天运行的查询数量,如果当天没有运行查询,它将为0。
这是我提出的查询的一部分,
SELECT
t1.usr_id,
case when t1.count_day_1 is null then 0 else t1.count_day_1 end as day_1,
case when t2.count_day_2 is null then 0 else t2.count_day_2 end as day_2
FROM
(SELECT usr_id, DAY(from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "yyyy-MM-dd")) as day_1,
COUNT( DAY(from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "yyyy-MM-dd"))) as count_day_1
FROM db.table
WHERE
DAY(from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "yyyy-MM-dd")) = 1
AND
from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "yyyy-MM-dd")
BETWEEN date_sub(from_unixtime(unix_timestamp()), 30)
AND from_unixtime(unix_timestamp())
GROUP BY usr_id, day_1) t1
LEFT JOIN
(SELECT usr_id, DAY(from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "yyyy-MM-dd")) as day_2,
COUNT( DAY(from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "yyyy-MM-dd"))) as count_day_2
FROM db.table
WHERE
DAY(from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "yyyy-MM-dd")) = 2
AND
from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "yyyy-MM-dd")
BETWEEN date_sub(from_unixtime(unix_timestamp()), 30)
AND from_unixtime(unix_timestamp())
GROUP BY usr_id, day_2) t2
ON (t1.usr_id = t2.usr_id)
ORDER BY t1.usr_id;
这很好用,它显示前2天每天运行的查询数,并将NULL替换为0。
问题是要使它在所有30天内都能正常工作,我必须使用30个LEFT JOIN,这会在群集上拉出约400GB以上的内存。
有更简单的方法吗?
答案 0 :(得分:2)
尝试在不进行连接的情况下执行此操作,并在WHERE中使用current_date或current_timestamp常量,而不是unix_timestamp(),此函数不是确定性的,其值在查询执行范围内也不固定,因此可以防止适当的查询优化-从2.0开始,建议使用CURRENT_TIMESTAMP常量:
select usr_id,
nvl(count(case when from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "dd") = 1 then 1 end),0) as day_1,
nvl(count(case when from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "dd") = 2 then 1 end),0) as day_2
...
from db.table
WHERE
from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "yyyy-MM-dd")
BETWEEN date_sub(current_date, 30) AND current_date)
group by usr_id