如何基于Snowflake中的列执行分层

时间:2019-10-18 16:19:36

标签: sql r snowflake-data-warehouse

我正在使用Snowflake编写我的SQL查询。我们有一个巨大的表,其中包含数十亿条包含客户信息的记录。目的是获取随机样本并使用R来查看分布。不幸的是,我们不能使用从RStudio到数据库的JDBC / ODBC连接。这是一个限制。因此,我只剩下从Snowflake中提取提取物并导入R中了。

困难之处在于,我们有一个名为 CUSTOMER SEGMENT 的列,其中包含近24个唯一值。目标是从每个细分中获取代表重要比例的样本。我尝试了以下查询;

SELECT DISTINCT *
FROM test sample(10)

以获得随机样本,其中每行具有10%的概率被选择。但是我没有从客户群的每个价值中获取样本。我可能知道任何sql命令,这些命令可以根据客户群进行分层。预先感谢。

2 个答案:

答案 0 :(得分:1)

对更多相等大小分区进行采样的另一种方法是使用循环采样

select t.*
from (select t.*, 
             row_number() over (partition by segment order by random()) as seqnum,
             count(*) over () as cnt
      from test t
     ) t
where seqnum <= 20;

“ 20”表示每个段最多20行。

可以针对基于百分比的样本进行修改。尚不清楚是否有必要。

答案 1 :(得分:0)

以下是 Snowflake(或 SQL)中的分层示例,基于以下内容:

https://en.wikipedia.org/wiki/Stratified_sampling

这可以作为固定数字或百分比返回。

这已被编译为单个查询,在我们的实现中,我们实际上预先创建了已排序的段列表 (W0) 作为临时表,而不是多次运行相同的查询。

SELECT 
    W1.Id, 
    W1.EmploymentStatus,    
    W1.Gender
    FROM 
        ( SELECT 
            ID, 
            Row_Number() OVER ( PARTITION BY COALESCE(Gender, '') || COALESCE(EmploymentStatus,'') ORDER BY random() ) as iInternalRank, 
            COALESCE(Gender, '') || COALESCE(EmploymentStatus,'') as sInternalGroupVal 
            FROM STAFF ) W0,
        STAFF W1    -- Linked back the original table (Inbound query)
    WHERE (
            SELECT 
                    MAX(case when W2.sInternalGroupVal = W3.sInternalGroupVal then W3.iGroupSegmentVolume else 0 end ) -- This is where the magic happens...
                    FROM ( 
                            SELECT 
                                ID, 
                                Row_Number() OVER ( PARTITION BY COALESCE(Gender, '') || COALESCE(EmploymentStatus,'') ORDER BY random() ) as iInternalRank, 
                                COALESCE(Gender, '') || COALESCE(EmploymentStatus,'') as sInternalGroupVal 
                                FROM STAFF ) W2, 
                        (SELECT 
                            sInternalGroupVal, 
                            COUNT(Id)*(40/iTotalPopulation::DOUBLE PRECISION) as iGroupSegmentVolume -- as a fixed volumne (40 Records)
                            --COUNT(Id)*((iTotalPopulation*(23/100.00))/iTotalPopulation::DOUBLE PRECISION) as iGroupSegmentVolume -- as a percentage (23% of overall population)
                            FROM (SELECT 
                                    ID, 
                                    Row_Number() OVER ( PARTITION BY COALESCE(Gender, '') || COALESCE(EmploymentStatus,'') ORDER BY random() ) as iInternalRank, 
                                    COALESCE(Gender, '') || COALESCE(EmploymentStatus,'') as sInternalGroupVal 
                                    FROM STAFF ), 
                                 (SELECT 
                                    COUNT(Id) as iTotalPopulation 
                                    FROM ( SELECT 
                                            ID, 
                                            Row_Number() OVER ( PARTITION BY COALESCE(Gender, '') || COALESCE(EmploymentStatus,'') ORDER BY random() ) as iInternalRank, 
                                            COALESCE(Gender, '') || COALESCE(EmploymentStatus,'') as sInternalGroupVal 
                                            FROM STAFF )
                                         ) W4
                            GROUP BY sInternalGroupVal, iTotalPopulation) W3
                    WHERE ((W2.sInternalGroupVal = W0.sInternalGroupVal)) AND ((W2.sInternalGroupVal = W3.sInternalGroupVal))
            ) >= iInternalRank 
            AND ((W1.Id = W0.Id));