我在bigquery中存储了以下事务。
CUSID PID TID YYYYMMDD
A01 P01 001 2017-01-01
A02 P01 002 2017-02-25
A02 P02 002 2017-02-25
A03 P02 003 2017-03-01
A03 P02 004 2017-03-05
A03 P02 004 2017-03-05
A04 P01 005 2017-03-10
A04 P03 005 2017-03-10
A04 P03 006 2017-03-11
A04 P03 007 2017-03-15
我想找到给定产品的两件事情如下:
1)X:购买客户的数量除以客户总数
2)Y:购买所有客户的平均日差距
因此,预期输出如下表
CUSID PID TID YYYYMMDD X Y
A01 P01 001 2017-01-01 3/4 = 0.75 AVG(0,0,0) = N/A (P01 does not have re-purchasing by A01, A02, and A04)
A02 P01 002 2017-02-25 3/4 = 0.50 AVG(0,0,0) = N/A (P01 is not re-purchased by A01, A02, and A04)
A02 P02 002 2017-02-25 2/4 = 0.50 AVG(0,4) = 4 (P02 is not re-purchased by A02 but it is re-purchased by A03 for 4 days. Note: duplicated product in the same TID is excluded, e.g. TID = 004)
A03 P02 003 2017-03-01 2/4 = 0.50 AVG(0,4) = 4 (P02 is not re-purchased by A02 but it is re-purchased by A03 for 4 days. Note: duplicated product in the same TID is excluded, e.g. TID = 004)
A03 P02 004 2017-03-05 2/4 = 0.50 AVG(0,4) = 4 (P02 is not re-purchased by A02 but it is re-purchased by A03 for 4 days. Note: duplicated product in the same TID is excluded, e.g. TID = 004)
A03 P02 004 2017-03-05 2/4 = 0.50 AVG(0,4) = 4 (P02 is not re-purchased by A02 but it is re-purchased by A03 for 4 days. Note: duplicated product in the same TID is excluded, e.g. TID = 004)
A04 P01 005 2017-03-10 3/4 = 0.75 AVG(0,0,0) = N/A (P01 is not re-purchased by A01, A02, and A04)
A04 P03 005 2017-03-10 1/4 = 0.25 AVG(1,4) = 2.5 (P03 is repurchased by A04 for 1 and 4 day gaps)
A04 P03 006 2017-03-11 1/4 = 0.25 AVG(1,4) = 2.5 (P03 is repurchased by A04 for 1 and 4 day gaps)
A04 P03 007 2017-03-15 1/4 = 0.25 AVG(1,4) = 2.5 (P03 is repurchased by A04 for 1 and 4 day gaps)
我可以提出你的建议吗?
答案 0 :(得分:2)
以下是您所描述的内容 它适用于BigQuery Standard SQL
#standardSQL
WITH data AS (
SELECT 'A01' AS CUSID, 'P01' AS PID, '001' AS TID, DATE '2017-01-01' AS YYYYMMDD UNION ALL
SELECT 'A02', 'P01', '002', DATE '2017-02-25' UNION ALL SELECT 'A02', 'P02', '002', DATE '2017-02-25' UNION ALL SELECT 'A03', 'P02', '003', DATE '2017-03-01' UNION ALL SELECT 'A03', 'P02', '004', DATE '2017-03-05' UNION ALL
SELECT 'A03', 'P02', '004', DATE '2017-03-05' UNION ALL SELECT 'A04', 'P01', '005', DATE '2017-03-10' UNION ALL SELECT 'A04', 'P03', '005', DATE '2017-03-10' UNION ALL SELECT 'A04', 'P03', '006', DATE '2017-03-11' UNION ALL
SELECT 'A04', 'P03', '007', DATE '2017-03-15'
),
popularity AS (
SELECT DISTINCT PID,
COUNT(DISTINCT CUSID) OVER(PARTITION BY PID) / COUNT(DISTINCT CUSID) OVER() AS X
FROM data
),
gaps AS (
SELECT CUSID, PID, TID, YYYYMMDD,
DATE_DIFF(YYYYMMDD, LAG(YYYYMMDD) OVER(PARTITION BY CUSID, PID ORDER BY YYYYMMDD), DAY) AS gap
FROM data
),
gaps_without_dups AS (
SELECT CUSID, PID, YYYYMMDD,
MAX(IFNULL(gap, 0)) AS gap
FROM gaps
GROUP BY CUSID, PID, YYYYMMDD
HAVING gap > 0
),
average_gaps AS (
SELECT PID, AVG(gap) AS Y
FROM gaps_without_dups
GROUP BY PID
)
SELECT CUSID, PID, TID, YYYYMMDD, X, Y
FROM data
LEFT JOIN popularity USING (PID)
LEFT JOIN average_gaps USING(PID)
-- ORDER BY TID, PID
输出符合预期
CUSID PID TID YYYYMMDD X Y
A01 P01 001 2017-01-01 0.75 null
A02 P01 002 2017-02-25 0.75 null
A02 P02 002 2017-02-25 0.5 4.0
A03 P02 003 2017-03-01 0.5 4.0
A03 P02 004 2017-03-05 0.5 4.0
A03 P02 004 2017-03-05 0.5 4.0
A04 P01 005 2017-03-10 0.75 null
A04 P03 005 2017-03-10 0.25 2.5
A04 P03 006 2017-03-11 0.25 2.5
A04 P03 007 2017-03-15 0.25 2.5
答案 1 :(得分:1)
这个查询也可以做到这一点(我假设如果一个事务id比另一个大,那么它的日期也更大):
SELECT
* EXCEPT(lead_date),
AVG(CASE WHEN lead_date != date THEN DATE_DIFF(parse_DATE("%Y-%m-%d", lead_date), parse_DATE("%Y-%m-%d", date), DAY) END) OVER(PARTITION BY PID) Y
FROM(
SELECT
*,
COUNT(DISTINCT cusid) OVER(PARTITION BY PID) / COUNT(DISTINCT cusid) OVER() X,
LEAD(date) OVER(PARTITION BY PID, cusid ORDER BY TID) lead_date
FROM
data )
其中data
是您的输入数据:
with data as(
select 'A01' as cusid, 'P01' as PID, 1 as TID, '2017-01-01' as date union all
select 'A02', 'P01', 2, '2017-02-25' UNION ALL
SELECT 'A02', 'P02', 2, '2017-02-25' UNION ALL
SELECT 'A03', 'P02', 3, '2017-03-01' UNION ALL
SELECT 'A03', 'P02', 4, '2017-03-05' UNION ALL
SELECT 'A03', 'P02', 4, '2017-03-05' UNION ALL
SELECT 'A04', 'P01', 5, '2017-03-10' UNION ALL
SELECT 'A04', 'P03', 5, '2017-03-10' UNION ALL
SELECT 'A04', 'P03', 6, '2017-03-11' UNION ALL
SELECT 'A04', 'P03', 7, '2017-03-15'
)
运行此类分析时,请务必使用analytical functions的概念。
BigQuery文档非常好,您可以安全地使用它来了解它的所有内容。 这将使您能够使用更简单,更快速的查询来运行非常复杂的查询。