例如,我的订单数据来自这样的客户
test = spark.createDataFrame([
(0, 1, 1, "2018-06-03"),
(1, 1, 1, "2018-06-04"),
(2, 1, 3, "2018-06-04"),
(3, 1, 2, "2018-06-05"),
(4, 1, 1, "2018-06-06"),
(5, 2, 3, "2018-06-01"),
(6, 2, 1, "2018-06-01"),
(7, 3, 1, "2018-06-02"),
(8, 3, 1, "2018-06-02"),
(9, 3, 1, "2018-06-05")
])\
.toDF("order_id", "customer_id", "order_status", "created_at")
test.show()
每个订单都有其自己的状态,1
表示新创建但尚未完成,3
表示已付款并完成。
现在,我想分析订单来自
所以我想在数据上方添加一个功能,变成这样
逻辑适用于每个客户,在状态为3
(包括自身)的第一笔订单之前创建的每笔订单都被计为来自new customer
的订单,此后的每笔订单都被计为{{1} }。
或者换句话说,选择值3首次出现之前的数据(对于每个客户的订单,按日期升序排列)
如何在SQL中执行此操作?
我四处搜寻,但没有找到好的解决方案。如果使用Python,我想也许我会做一些循环来获取值。
答案 0 :(得分:0)
已针对SQLite进行了测试:
SELECT order_id, customer_id, order_status, created_at,
CASE
WHEN order_id > (SELECT MIN(order_id) FROM orders WHERE customer_id = o.customer_id AND order_status = 3) THEN 'old'
ELSE 'new'
END AS customer_status
FROM orders o
答案 1 :(得分:0)
您可以使用Spark中的窗口功能来做到这一点:
select t.*,
(case when created_at > min(case when status = 3 then created_at end) over (partition by customer_id)
then 'old'
else 'new'
end) as customer_status
from test t;
请注意,这会将“新”分配给状态为“ 3”的没有订单的客户。
您也可以使用join
和group by
来编写此代码:
select t.*,
coalesce(t3.customer_status, 'old') as customer_status
from test t left join
(select t.customer_id, min(created_at) as min_created_at,
'new' as customer_status
from t
where status = 3
group by t.customer_id
) t3
on t.customer_id = t3.customer_id and
t.created_at <= t3.min_created_at;