我的数据是:
User id product_id action
1 apple incart
1 apple purchased
1 banana incart
2 banana incart
2 banana purchased
3 carrot incart
我需要输出为user_id和product_id,它们的动作仅具有因果关系而不是购买的。
答案 0 :(得分:0)
val df1 = df.filter(col("action") === "purchased")
val df2 = df.filter(col("action") === "incart")
df2.join(df1,df2.col("User_id") === df1.col("User_id") && df2.col("product_id") === df1.col("product_id"),"leftanti").drop("action").show
答案 1 :(得分:0)
假设您有这样的DF:
+-------+----------+----------+
|User_id|product_id| action|
+-------+----------+----------+
| 1| apple| incart|
| 1| apple|purchased |
| 1| banana| incart|
| 2| banana| incart|
| 2| banana| purchased|
| 3| carrot| incart|
+-------+----------+----------+
一种方法可能是应用groupBy创建一个包含所有动作的新字段,然后根据所需条件进行过滤。
val output = df.groupBy("User_id","product_id").agg(collect_list("action").as("set"))
然后根据需要过滤。在这种情况下:
output.where(array_contains($"set", "incart").and(!array_contains($"set", "purchased"))).select("User_id","product_id").show()
将产生预期的输出:
+-------+----------+
|User_id|product_id|
+-------+----------+
| 3| carrot|
| 1| banana|
+-------+----------+
答案 2 :(得分:0)
您可以在NOT EXISTS
中使用HIVE
:
SELECT t.userid, t.product_id
FROM table t
WHERE action = 'incart' AND
NOT EXISTS (SELECT 1
FROM table t1
WHERE t1.userid = t.userid and
t1.product_id = t.product_id and
t1.action = 'purchased'
);
答案 3 :(得分:0)
使用简单聚合+ case
:
SELECT t.userid, t.product_id
FROM
(
SELECT t.userid, t.product_id,
max(case when t.action = 'purchased' then 1 else 0 end) has_purchased,
max(case when t.action = 'incart' then 1 else 0 end) has_incart
FROM table t
GROUP BY t.userid, t.product_id
) s
WHERE has_purchased=0 and has_incart=1;