如何创建一个在Snowflake中工作的循环?

时间:2019-12-04 11:46:24

标签: python for-loop snowflake-data-warehouse snowsql

我试图在python中创建一个for循环以将其连接到Snowflake,因为Snowflake不支持循环。 我想从不同的AgeGroups中选择一些随机行。例如。年龄组“ 30-40”中的1500行,年龄组“ 40-50”中的1200行,年龄组“ 50-60”中的875行。

是否有任何想法或雪花中循环的替代方法?

4 个答案:

答案 0 :(得分:2)

您是否看过Snowflake的存储过程?它们是Javascript,可让您在Snowflake中进行本地循环:

https://docs.snowflake.net/manuals/sql-reference/stored-procedures-overview.html

答案 1 :(得分:0)

如果要从每个组中抽取n个随机样本,则可以创建一个子查询,该子查询包含一个在每个组中随机分布的行号,然后从每个组中选择前n行。

如果您有一个这样的表:

USER    DATE
1       2018-11-04
1       2018-11-04
1       2018-12-07
1       2018-10-09
1       2018-10-09
1       2018-11-07
1       2018-11-09
1       2018-11-09
2       2019-11-02
2       2019-10-02
2       2019-11-03
2       2019-11-06
3       2019-11-10
3       2019-11-13
3       2019-11-15

此查询可用于为用户2和3返回两个随机行,为用户1返回3个随机行。

SELECT User, Date 
FROM (
    SELECT *, ROW_NUMBER() OVER(PARTITION BY User ORDER BY RANDOM()) as random_row 
    FROM Users) 
WHERE 
    (User = 3 AND random_row < 3) OR 
    (User = 2 AND random_row < 3) OR 
    (User = 1 AND random_row < 4);

因此,在您的情况下,分区并过滤age_group而不是User

答案 2 :(得分:0)

雪花支持随机和确定性表采样。例如:

返回表的样本,其中每一行都有10%的概率包含在样本中:

SELECT * FROM testtable SAMPLE (10);

https://docs.snowflake.net/manuals/sql-reference/constructs/sample.html

答案 3 :(得分:0)

“雪花没有循环”是什么意思?如果您能找到SQL,它们就有“循环” ...

以下查询可满足您的要求:

WITH POPULATION AS ( /* 10,000 persons with random age 0-100 */
  SELECT 'Person ' || SEQ2() ID, ABS(RANDOM()) % 100 AGE
  FROM TABLE(GENERATOR(ROWCOUNT => 10000))
)
SELECT
  ID,
  AGE,
  CASE
    WHEN AGE < 30 THEN '0-30'
    WHEN AGE < 40 THEN '30-40'
    WHEN AGE < 50 THEN '40-50'
    WHEN AGE < 60 THEN '50-60'
    ELSE '60-100'
  END AGE_GROUP,
  ROW_NUMBER() OVER (PARTITION BY AGE_GROUP ORDER BY RANDOM()) DRAW_ORDER
FROM POPULATION
QUALIFY DRAW_ORDER <= DECODE(AGE_GROUP, '30-40', 1500, '40-50', 1200, '50-60', 875, 0);

附录

waldente指出,一种更简单有效的方法是使用SAMPLE

WITH
POPULATION_30_40 AS (SELECT * FROM POPULATION WHERE AGE >= 30 AND AGE < 40),
POPULATION_40_50 AS (SELECT * FROM POPULATION WHERE AGE >= 40 AND AGE < 50),
POPULATION_50_60 AS (SELECT * FROM POPULATION WHERE AGE >= 50 AND AGE < 60)
SELECT * FROM POPULATION_30_40 SAMPLE(1500 ROWS) UNION ALL 
SELECT * FROM POPULATION_40_50 SAMPLE(1200 ROWS) UNION ALL 
SELECT * FROM POPULATION_50_60 SAMPLE(875 ROWS)