我正在使用Snowflake编写我的SQL查询。我们有一个巨大的表,其中包含数十亿条包含客户信息的记录。目的是获取随机样本并使用R来查看分布。不幸的是,我们不能使用从RStudio到数据库的JDBC / ODBC连接。这是一个限制。因此,我只剩下从Snowflake中提取提取物并导入R中了。
困难之处在于,我们有一个名为 CUSTOMER SEGMENT 的列,其中包含近24个唯一值。目标是从每个细分中获取代表重要比例的样本。我尝试了以下查询;
SELECT DISTINCT *
FROM test sample(10)
以获得随机样本,其中每行具有10%的概率被选择。但是我没有从客户群的每个价值中获取样本。我可能知道任何sql命令,这些命令可以根据客户群进行分层。预先感谢。
答案 0 :(得分:1)
对更多相等大小分区进行采样的另一种方法是使用循环采样
select t.*
from (select t.*,
row_number() over (partition by segment order by random()) as seqnum,
count(*) over () as cnt
from test t
) t
where seqnum <= 20;
“ 20”表示每个段最多20行。
可以针对基于百分比的样本进行修改。尚不清楚是否有必要。
答案 1 :(得分:0)
以下是 Snowflake(或 SQL)中的分层示例,基于以下内容:
https://en.wikipedia.org/wiki/Stratified_sampling
这可以作为固定数字或百分比返回。
这已被编译为单个查询,在我们的实现中,我们实际上预先创建了已排序的段列表 (W0) 作为临时表,而不是多次运行相同的查询。
SELECT
W1.Id,
W1.EmploymentStatus,
W1.Gender
FROM
( SELECT
ID,
Row_Number() OVER ( PARTITION BY COALESCE(Gender, '') || COALESCE(EmploymentStatus,'') ORDER BY random() ) as iInternalRank,
COALESCE(Gender, '') || COALESCE(EmploymentStatus,'') as sInternalGroupVal
FROM STAFF ) W0,
STAFF W1 -- Linked back the original table (Inbound query)
WHERE (
SELECT
MAX(case when W2.sInternalGroupVal = W3.sInternalGroupVal then W3.iGroupSegmentVolume else 0 end ) -- This is where the magic happens...
FROM (
SELECT
ID,
Row_Number() OVER ( PARTITION BY COALESCE(Gender, '') || COALESCE(EmploymentStatus,'') ORDER BY random() ) as iInternalRank,
COALESCE(Gender, '') || COALESCE(EmploymentStatus,'') as sInternalGroupVal
FROM STAFF ) W2,
(SELECT
sInternalGroupVal,
COUNT(Id)*(40/iTotalPopulation::DOUBLE PRECISION) as iGroupSegmentVolume -- as a fixed volumne (40 Records)
--COUNT(Id)*((iTotalPopulation*(23/100.00))/iTotalPopulation::DOUBLE PRECISION) as iGroupSegmentVolume -- as a percentage (23% of overall population)
FROM (SELECT
ID,
Row_Number() OVER ( PARTITION BY COALESCE(Gender, '') || COALESCE(EmploymentStatus,'') ORDER BY random() ) as iInternalRank,
COALESCE(Gender, '') || COALESCE(EmploymentStatus,'') as sInternalGroupVal
FROM STAFF ),
(SELECT
COUNT(Id) as iTotalPopulation
FROM ( SELECT
ID,
Row_Number() OVER ( PARTITION BY COALESCE(Gender, '') || COALESCE(EmploymentStatus,'') ORDER BY random() ) as iInternalRank,
COALESCE(Gender, '') || COALESCE(EmploymentStatus,'') as sInternalGroupVal
FROM STAFF )
) W4
GROUP BY sInternalGroupVal, iTotalPopulation) W3
WHERE ((W2.sInternalGroupVal = W0.sInternalGroupVal)) AND ((W2.sInternalGroupVal = W3.sInternalGroupVal))
) >= iInternalRank
AND ((W1.Id = W0.Id));