BigQuery中的NTILE()用于非均匀存储桶

时间:2018-07-02 10:17:12

标签: sql google-bigquery

我正在尝试对BigQuery上的Google Merchandise Store示例数据集执行RFM细分。在我的SQL查询中,NTILE(5)根据行顺序将行分为5个存储桶,并返回分配给每行的存储桶编号。在这种情况下,每个铲斗的大小均相等。想了解如何创建不同大小的存储桶。例如,存储桶1包含底部的10%,存储桶2包含接下来的20%的记录,等等。谢谢!

#standard SQL    
  SELECT
      fullVisitorId,
      NTILE(5) OVER (ORDER BY last_order_date) AS rfm_recency,
      NTILE(5) OVER (ORDER BY count_order) AS rfm_frequency,
      NTILE(5) OVER (ORDER BY avg_amount) AS rfm_monetary
    FROM (
      SELECT
        fullVisitorId,
        MAX(date) AS last_order_date,
        COUNT(*) AS count_order,
        AVG(totals.totalTransactionRevenue)/1000000 AS avg_amount
      FROM
        `bigquery-public-data.google_analytics_sample.ga_sessions_20170*`
      WHERE
        _table_suffix BETWEEN "101"
        AND "801"
        AND totals.totalTransactionRevenue IS NOT NULL
      GROUP BY
        fullVisitorId )

2 个答案:

答案 0 :(得分:2)

您可以使用row_number()count(*)定义自己的存储桶:

SELECT fullVisitorId,
       (CASE WHEN seqnum_r <= 0.1 * cnt THEN 1
             WHEN seqnum_r <= 0.3 * cnt THEN 2
             ELSE 3
        END) as bin_r,
       . . .               
FROM (SELECT fullVisitorId,
             MAX(date) AS last_order_date,
             COUNT(*) AS count_order,
             (AVG(totals.totalTransactionRevenue) / 1000000) AS avg_amount,
             COUNT(*) OVER () as cnt,
             ROW_NUMBER() OVER (ORDER BY MAX(date)) as seqnum_r,
             ROW_NUMBER() OVER (ORDER BY COUNT(*)) as seqnum_f,
             ROW_NUMBER() OVER (ORDER BY AVG(totals.totalTransactionRevenue)) as seqnum_m
      FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170*`
      WHERE _table_suffix BETWEEN "101" AND "801" AND
            totals.totalTransactionRevenue IS NOT NULL
      GROUP BY fullVisitorId
     ) rfm

答案 1 :(得分:1)

以下内容适用于BigQuery Standard SQL,并假定您的初始查询适合您,SQL UDF NON_UNIFORM_BUCKET()会为您解决问题

#standard SQL    
CREATE TEMP FUNCTION NON_UNIFORM_BUCKET(i INT64) AS (
  CASE 
    WHEN i = 1 THEN 1
    WHEN i IN (2, 3) THEN 2
    WHEN i IN (4, 5, 6) THEN 3
    WHEN i = 7 THEN 4
    ELSE 5
  END
);
  SELECT
      fullVisitorId,
      NON_UNIFORM_BUCKET(NTILE(10) OVER (ORDER BY last_order_date)) AS rfm_recency,
      NON_UNIFORM_BUCKET(NTILE(10) OVER (ORDER BY count_order)) AS rfm_frequency,
      NON_UNIFORM_BUCKET(NTILE(10) OVER (ORDER BY avg_amount)) AS rfm_monetary
    FROM (
      SELECT
        fullVisitorId,
        MAX(date) AS last_order_date,
        COUNT(*) AS count_order,
        AVG(totals.totalTransactionRevenue)/1000000 AS avg_amount
      FROM
        `bigquery-public-data.google_analytics_sample.ga_sessions_20170*`
      WHERE
        _table_suffix BETWEEN "101"
        AND "801"
        AND totals.totalTransactionRevenue IS NOT NULL
      GROUP BY
        fullVisitorId )