这是一个举例说明我的问题。
在此示例中,我们收集每个用户已购买的其他产品的列表,并将其作为新列附加到事务表中。 (另请注意,我们正在过滤某些任意列'good_bad'。)
我想知道Spark SQL是否支持在PARTITION BY窗口函数中不包含CURRENT ROW。
例如,交易1将为other_purchases = [prod2, prod3]
而不是other_purchases = [prod1, prod2, prod3]
。
df = spark.createDataFrame([
(1, "user1", "prod1", "good"),
(2, "user1", "prod2", "good"),
(3, "user1", "prod3", "good"),
(4, "user2", "prod3", "bad"),
(5, "user2", "prod4", "good"),
(5, "user2", "prod5", "good")],
("trans_id", "user_id", "prod_id", "good_bad")
)
df.show()
df = df.selectExpr(
"trans_id",
"user_id",
"COLLECT_LIST(CASE WHEN good_bad == 'good' THEN prod_id END) OVER(PARTITION BY user_id) AS other_purchases"
)
df.show()
答案 0 :(得分:0)
prod_id
替换当前行''
的实例。
为清楚起见,下面分两步显示。
<强>代码:强>
%pyspark
df = spark.createDataFrame([
(1, "user1", "prod1", "good"),
(2, "user1", "prod2", "good"),
(3, "user1", "prod3", "good"),
(4, "user2", "prod3", "bad"),
(5, "user2", "prod4", "good"),
(5, "user2", "prod5", "good")],
("trans_id", "user_id", "prod_id", "good_bad")
)
df.show()
df = df.selectExpr(
"trans_id",
"user_id",
"prod_id",
"COLLECT_LIST(CASE WHEN good_bad == 'good' THEN prod_id END) OVER(PARTITION BY user_id) AS other_purchases"
)
df = df.selectExpr(
"trans_id",
"user_id",
"prod_id",
"other_purchases",
"SPLIT(TRIM(REGEXP_REPLACE(CONCAT_WS(' ', other_purchases), prod_id, '')), '[ ]+') AS other_purchases_filtered"
)
df.show()
<强>输出:强>
+--------+-------+-------+--------+
|trans_id|user_id|prod_id|good_bad|
+--------+-------+-------+--------+
| 1| user1| prod1| good|
| 2| user1| prod2| good|
| 3| user1| prod3| good|
| 4| user2| prod3| bad|
| 5| user2| prod4| good|
| 5| user2| prod5| good|
+--------+-------+-------+--------+
+--------+-------+-------+--------------------+------------------------+
|trans_id|user_id|prod_id| other_purchases|other_purchases_filtered|
+--------+-------+-------+--------------------+------------------------+
| 1| user1| prod1|[prod1, prod2, pr...| [prod2, prod3]|
| 2| user1| prod2|[prod1, prod2, pr...| [prod1, prod3]|
| 3| user1| prod3|[prod1, prod2, pr...| [prod1, prod2]|
| 4| user2| prod3| [prod4, prod5]| [prod4, prod5]|
| 5| user2| prod4| [prod4, prod5]| [prod5]|
| 5| user2| prod5| [prod4, prod5]| [prod4]|
+--------+-------+-------+--------------------+------------------------+