Question

给出下表：

transaction_id    user_id    product_id
             1         10            AA
             2         10            CC
             3         10            AA
             4         10            CC
             5         20            AA
             6         20            BB
             7         20            BB
             8         30            BB
             9         30            BB
            10         30            BB
            11         40            CC
            12         40            AA
            13         40            CC
            14         40            BB
            15         40            BB
            16         50            EE
            17         60            EE

使用以下查询：

select
  product_id,
  count(distinct user_id) as count_repeat_users
from
  product_usage_log
where
  (product_id, user_id) in (
    select
      product_id,
      user_id
    from (
      select
        product_id,
        user_id,
        count (distinct transaction_id) as transactions
      from
        product_usage_log
      group by
        product_id,
        user_id
    ) t
    where transactions >= 2
  )
group by product_id

返回以下结果：

product_id    count_repeat_users
        AA                     1
        BB                     3
        CC                     2

(note that 'EE' doesn't appear, as expected)

上面查询的目的是针对每种产品返回至少使用该产品进行过两次交易的用户数。上面的查询满足了此要求，但是它使用带有IN谓词的多列子查询。这种功能在Presto中不可用（尽管，尽管在过去两年中一直没有被提及），但仍然没有。

如何在无法使用where (product_id, user_id) in (...)的情况下复制以上结果？

注意：我试图将where条件展平为两个连续的条件，问题是现在所有行匹配的所有列上的条件变成了任何行匹配的所有列上的条件。换句话说，现在只要产品位于子表中，而用户位于子表中（但不一定在同一行中），它就会与用户产品对匹配。

因此，表达该问题的另一种方法是：在Presto中，如何基于子查询的SAME行中存在的几个值来建立条件？

Answer 1

如何在无法使用(product_id, user_id) in (...)的情况下复制以上结果？

直接在Presto中可用。您只需要将子查询产生的值包装在匿名ROW中，以便它们实际上是单列。

使用Presto 318进行测试：

presto:default> SELECT
             ->     x, y
             -> FROM (VALUES (1,2), (3,4), (5,6)) t(x, y)
             -> WHERE (x, y) IN (
             ->     SELECT (z, w)
             ->     FROM (VALUES (1,1), (3,4), (5,5)) u(z, w)
             -> );
 x | y
---+---
 3 | 4
(1 row)

另一个具有tpch.tiny模式的示例：

presto:tiny> SELECT orderkey
          -> FROM orders
          -> JOIN customer ON orders.custkey = customer.custkey
          -> WHERE (orderkey, nationkey) IN (
          ->     SELECT (suppkey, nationkey) FROM supplier
          -> );
 orderkey
----------
        3
(1 row)

注意：对于NULL，我不完全确定此方法是否正确。我想这对您来说不是问题，您的子查询不会为NULL生成product_id, user_id。

Answer 2

您可以使用窗口功能。我认为这会起作用：

select product_id, count(distinct user_id)
from (select pul.*,
             count(*) over (partition by product_id, user_id) as cnt
      from product_usage_log pul
     ) pul
where cnt >= 2
group by product_id;

根据您的示例数据，我猜测transaction_id是唯一的。如果不是，则在子查询中使用count(distinct transaction_id)。

Answer 3

我看不出为什么使用WHERE...IN...的原因（至少从示例数据中）。
没有它，您可以得到所需的东西：

select t.product_id, count(*) count_repeat_users
from (
  select user_id, product_id
  from product_usage_log  
  group by user_id, product_id
  having count(transaction_id) > 1
) as t
group by product_id

请参见demo（对于SQL Server，但是由于代码是标准SQL，它也适用于Presto）。
结果：

product_id | count_repeat_users
AA         |                  1
BB         |                  3
CC         |                  2

尚不支持子查询返回的多列

3 个答案: