Question

给出以下源数据（例如表名称为user_activity）

+---------+-----------+------------+
| user_id | user_type | some_date  |
+---------+-----------+------------+
| 1       | a         | 2018-01-01 |
| 1       | a         | 2018-01-02 |
| 2       | a         | 2018-01-01 |
| 3       | a         | 2018-01-01 |
| 4       | b         | 2018-01-01 |
| 4       | b         | 2018-01-02 |
| 5       | b         | 2018-01-02 |
+---------+-----------+------------+

我想得到以下结果：

+-----------+------------+---------------------+
| user_type | user_count | average_daily_users |
+-----------+------------+---------------------+
| a         | 3          | 2                   |
| b         | 2          | 1.5                 |
+-----------+------------+---------------------+

在同一表上使用单个查询而没有多个子查询。

使用多个查询，我可以获得：

user_count：

select
  user_type,
  count(distinct user_id)
from user_activity
group by user_type

对于average_daily_users：

select
  user_type,
  avg(distinct_users) as average_daily_users
from (
  select
    count(distinct user_id) as distinct_users
  from user_activity
  group by user_type, some_date
)
group by user_type

但是我似乎无法一口气完成查询。我担心多个子查询对同一个表的性能产生影响（它将不得不对表进行两次扫描...对吗？）我有一个相当大的数据源，并且希望最小化运行时间。

注意：这个问题名为Hive，因为这是我正在使用的，但是我认为这是一个足够普通的SQL问题，因此我不排除使用其他语言的答案。 < / p>

注2：该问题与窗口函数partition by列中的my other question共享详细信息（用于计算每日平均用户列）。

Answer 1

这应该做您想要的：

select ua.user_type,
       count(distinct ua.user_id) as user_count,
       count(distinct some_date || ':' || ua.user_id) / count(distinct some_date)
from user_activity ua
group by ua.user_type;

配置单元-分层组上的多个（平均）计数差异

1 个答案: