Question

我正在尝试在Snowflake中运行以下查询，但是它失败，并显示Unsupported subquery type cannot be evaluated。该查询在其他SQL引擎（例如Postgresql和Presto）中有效，因此看来Snowflake不支持这种查询。

SELECT first_action.date, 
  DATEDIFF('day', first_action.date, returning_action.date) - 1 as diff, 
  APPROXIMATE_SIMILARITY(select MINHASH_COMBINE(value) from (select first_action.user_id_set as value union all select returning_action.user_id_set)) _set
  FROM (select cast(_time as date) as date, minhash(100, _user) as user_id_set from events group by 1) as first_action
  JOIN (select cast(_time as date) as date, minhash(100, _user) as user_id_set from events group by 1) as returning_action 
ON (first_action.date < returning_action.date AND dateadd(day, 14, first_action.date) >= returning_action.date)
group by 1,2

该查询是使用MinHash的典型同类查询。我们每天计算MinHash，然后加入接下来的14天并合并结果，最后计算最终结果。

由于MinHash没有线性MINHASH_COMBINE函数，因此我们必须对UNION全部使用子查询，以使其正常工作，但这也没有解决。：/

由于我们真的不知道任何解决方法，因此我们现在陷入困境。任何帮助表示赞赏！

Answer 1

不确定是否可行，请尝试使用first_action语句将returning_action和WITH分开：

WITH 
first_action as (
    SELECT 
        TRY_CAST(_time AS DATE) as date, 
        MINHASH(100, _user) as user_id_set 
    FROM events 
    GROUP BY 1
),
returning_action as (
    SELECT 
        TRY_CAST(_time AS DATE) as date, 
        MINHASH(100, _user) as user_id_set 
    FROM events 
    GROUP BY 1
),
SELECT 
  first_action.date, 
  DATEDIFF('day', fa.date, ra.date) - 1 as diff, 
  APPROXIMATE_SIMILARITY(
      SELECT MINHASH_COMBINE(value) 
      FROM (
          SELECT fa.user_id_set AS value FROM first_action fa
          UNION ALL  
          SELECT ra.user_id_set AS value FROM returning_action ra
      )
  ) _set
FROM first_action fa
JOIN returning_action ra
ON (fa.date < ra.date AND DATEADD(day, 14, fa.date) >= ra.date)
GROUP BY 1,2

Answer 2

So the major trick is that all these MINHASH_ functions window functions so you need to build a grouping key on the data.

So using this as my example data:

CREATE TABLE events(_user number, _time timestamp_ntz);
INSERT INTO events VALUES (1,'2019-03-01'),(1,'2019-03-05'),(1,'2019-03-10'),
    (1,'2019-03-14'),(1,'2019-03-15'),(1,'2019-03-16'),
    (2,'2019-03-01'),(2,'2019-03-05'),(2,'2019-03-11'),
    (2,'2019-03-15');

The first set is to get the 14 days of data for the COMBINE

WITH actions AS (
    SELECT _time::date as date
        ,dateadd(day, 14, date) as date14
        ,minhash(100, _user) as user_id_set
    FROM events
    GROUP BY 1
)
SELECT fa.date
    ,ARRAY_AGG(ra.date) WITHIN GROUP (ORDER BY ra.date)
    ,MINHASH_COMBINE(ra.user_id_set) AS sets
FROM actions AS fa
JOIN actions AS ra 
    ON (fa.date <= ra.date AND fa.date14 > ra.date) 
GROUP BY 1
ORDER BY 1;

this is similar to your code, but here I include on the RA the same day as FA. so I can group by FA.date, but have FA's data included. On the date range I was not sure if you are want 14 days later or 14 days worth of date. I assume the later, thus changed the end range termination.

Now we have the combined next 14 days of data for each day, we want to get the pairs of (in my code I do not put a max days of compare but just include all pairs). Now again the APPROXIMATE_SIMILARITY is a window function, so I build an Array which I will rip straight apart again, thus pivoting the data, which is what you're trying to do via the union all (this can bee seen in the pairs and unrolled CTE's)

WITH actions AS (
    SELECT _time::date AS date
        ,dateadd(day, 14, date) as date14
        ,minhash(100, _user) AS user_id_set
    FROM events
    GROUP BY 1
), combined AS (
    SELECT fa.date
        ,MINHASH_COMBINE(ra.user_id_set) AS sets
    FROM actions AS fa
    JOIN actions AS ra 
        ON fa.date <= ra.date AND fa.date14 > ra.date
    GROUP BY 1
), pairs AS (
    SELECT fa.date
        ,DATEDIFF('day', fa.date, ra.date) AS diff
        ,ARRAY_CONSTRUCT(fa.sets,ra.sets) AS comp_set
    FROM combined AS fa
    JOIN combined AS ra 
        ON fa.date < ra.date
), unrolled AS (
    SELECT date
        ,diff
        ,f.value AS sets
    FROM pairs p,
    LATERAL FLATTEN(input => p.comp_set) f
)
SELECT date
    ,diff
    ,APPROXIMATE_SIMILARITY(sets)
FROM unrolled
GROUP BY 1,2
ORDER BY 1,2;

and thus you get results for all days

DATE    DIFF    APPROXIMATE_SIMILARITY(SETS)
2019-03-01  4   1
2019-03-01  9   1
2019-03-01  10  1
2019-03-01  13  1
2019-03-01  14  1
2019-03-01  15  0.51
2019-03-05  5   1
2019-03-05  6   1
2019-03-05  9   1
2019-03-05  10  1
2019-03-05  11  0.51
2019-03-10  1   1
2019-03-10  4   1
2019-03-10  5   1
2019-03-10  6   0.51
2019-03-11  3   1
2019-03-11  4   1
2019-03-11  5   0.51
2019-03-14  1   1
2019-03-14  2   0.51
2019-03-15  1   0.51

Snowflake中的相关子查询不起作用

2 个答案: