如何在以下情况下在Spark和Hive查询中写入

时间:2018-09-05 07:36:19

标签: sql apache-spark hadoop hive

我的数据是:

User id     product_id    action

1                apple             incart
1                 apple            purchased 
1                 banana         incart
2                 banana         incart
2                 banana         purchased
3                 carrot            incart

我需要输出为user_id和product_id,它们的动作仅具有因果关系而不是购买的。

4 个答案:

答案 0 :(得分:0)

val df1 = df.filter(col("action") === "purchased")
val df2 = df.filter(col("action") === "incart")
df2.join(df1,df2.col("User_id") === df1.col("User_id") && df2.col("product_id") === df1.col("product_id"),"leftanti").drop("action").show

答案 1 :(得分:0)

假设您有这样的DF:

+-------+----------+----------+
|User_id|product_id|    action|
+-------+----------+----------+
|      1|     apple|    incart|
|      1|     apple|purchased |
|      1|    banana|    incart|
|      2|    banana|    incart|
|      2|    banana| purchased|
|      3|    carrot|    incart|
+-------+----------+----------+

一种方法可能是应用groupBy创建一个包含所有动作的新字段,然后根据所需条件进行过滤。

val output = df.groupBy("User_id","product_id").agg(collect_list("action").as("set"))

然后根据需要过滤。在这种情况下:

output.where(array_contains($"set", "incart").and(!array_contains($"set", "purchased"))).select("User_id","product_id").show()

将产生预期的输出:

+-------+----------+
|User_id|product_id|
+-------+----------+
|      3|    carrot|
|      1|    banana|
+-------+----------+

答案 2 :(得分:0)

您可以在NOT EXISTS中使用HIVE

SELECT t.userid, t.product_id 
FROM table t
WHERE action = 'incart' AND
      NOT EXISTS (SELECT 1 
                  FROM table t1 
                  WHERE t1.userid = t.userid and 
                        t1.product_id = t.product_id and 
                        t1.action = 'purchased'
                 );

答案 3 :(得分:0)

使用简单聚合+ case

SELECT t.userid, t.product_id
FROM
(
SELECT t.userid, t.product_id, 
       max(case when t.action = 'purchased' then 1 else 0 end) has_purchased,
       max(case when t.action = 'incart'    then 1 else 0 end) has_incart
FROM table t
GROUP BY t.userid, t.product_id
) s
WHERE has_purchased=0 and has_incart=1;