SQL:在首次出现某个值之前选择数据

时间:2018-11-25 08:13:15

标签: sql pyspark

例如,我的订单数据来自这样的客户

test = spark.createDataFrame([
    (0, 1, 1, "2018-06-03"),
    (1, 1, 1, "2018-06-04"),
    (2, 1, 3, "2018-06-04"),
    (3, 1, 2, "2018-06-05"),
    (4, 1, 1, "2018-06-06"),
    (5, 2, 3, "2018-06-01"),
    (6, 2, 1, "2018-06-01"),
    (7, 3, 1, "2018-06-02"),
    (8, 3, 1, "2018-06-02"),
    (9, 3, 1, "2018-06-05")
])\
  .toDF("order_id", "customer_id", "order_status", "created_at")
test.show()

enter image description here

每个订单都有其自己的状态,1表示新创建但尚未完成,3表示已付款并完成。

现在,我想分析订单来自

  • 新客户(之前没有购买过的商品)
  • 老客户(之前已经完成购买)

所以我想在数据上方添加一个功能,变成这样

enter image description here

逻辑适用于每个客户,在状态为3(包括自身)的第一笔订单之前创建的每笔订单都被计为来自new customer的订单,此后的每笔订单都被计为{{1} }。

或者换句话说,选择值3首次出现之前的数据(对于每个客户的订单,按日期升序排列)

如何在SQL中执行此操作?

我四处搜寻,但没有找到好的解决方案。如果使用Python,我想也许我会做一些循环来获取值。

2 个答案:

答案 0 :(得分:0)

已针对SQLite进行了测试:

SELECT order_id, customer_id, order_status, created_at, 
CASE
     WHEN order_id > (SELECT MIN(order_id) FROM orders WHERE customer_id = o.customer_id AND order_status = 3) THEN 'old'
     ELSE 'new'  
END AS customer_status
FROM orders o

答案 1 :(得分:0)

您可以使用Spark中的窗口功能来做到这一点:

select t.*,
       (case when created_at > min(case when status = 3 then created_at end) over (partition by customer_id)
             then 'old'
             else 'new'
        end) as customer_status
from test t;

请注意,这会将“新”分配给状态为“ 3”的没有订单的客户。

您也可以使用joingroup by来编写此代码:

select t.*,
       coalesce(t3.customer_status, 'old') as customer_status
from test t left join
     (select t.customer_id, min(created_at) as min_created_at,
             'new' as customer_status
      from t
      where status = 3
      group by t.customer_id
     ) t3
     on t.customer_id = t3.customer_id and
        t.created_at <= t3.min_created_at;