Spark SQL:在PARTITION BY窗口函数中排除CURRENT ROW

时间:2017-04-03 09:13:03

标签: apache-spark pyspark apache-spark-sql spark-dataframe

这是一个举例说明我的问题。

在此示例中,我们收集每个用户已购买的其他产品的列表,并将其作为新列附加到事务表中。 (另请注意,我们正在过滤某些任意列'good_bad'。)

我想知道Spark SQL是否支持在PARTITION BY窗口函数中不包含CURRENT ROW。

例如,交易1将为other_purchases = [prod2, prod3]而不是other_purchases = [prod1, prod2, prod3]

df = spark.createDataFrame([
    (1, "user1", "prod1", "good"), 
    (2, "user1", "prod2", "good"), 
    (3, "user1", "prod3", "good"), 
    (4, "user2", "prod3", "bad"), 
    (5, "user2", "prod4", "good"), 
    (5, "user2", "prod5", "good")], 
    ("trans_id", "user_id", "prod_id", "good_bad")
)
df.show()

df = df.selectExpr(
    "trans_id", 
    "user_id", 
    "COLLECT_LIST(CASE WHEN good_bad == 'good' THEN prod_id END) OVER(PARTITION BY user_id) AS other_purchases"
)
df.show()

1 个答案:

答案 0 :(得分:0)

好的,所以我找到了一个解决方案,但这有点荒谬。它涉及将数组连接成一个字符串,然后用prod_id替换当前行''的实例。

为清楚起见,下面分两步显示。

<强>代码:

%pyspark
df = spark.createDataFrame([ 
    (1, "user1", "prod1", "good"), 
    (2, "user1", "prod2", "good"), 
    (3, "user1", "prod3", "good"), 
    (4, "user2", "prod3", "bad"), 
    (5, "user2", "prod4", "good"), 
    (5, "user2", "prod5", "good")], 
    ("trans_id", "user_id", "prod_id", "good_bad") 
) 
df.show() 

df = df.selectExpr( 
    "trans_id", 
    "user_id",
    "prod_id",
    "COLLECT_LIST(CASE WHEN good_bad == 'good' THEN prod_id END) OVER(PARTITION BY user_id) AS other_purchases" 
) 

df = df.selectExpr( 
    "trans_id", 
    "user_id",
    "prod_id",
    "other_purchases",
    "SPLIT(TRIM(REGEXP_REPLACE(CONCAT_WS(' ', other_purchases), prod_id, '')), '[ ]+') AS other_purchases_filtered" 
) 
df.show() 

<强>输出:

+--------+-------+-------+--------+
|trans_id|user_id|prod_id|good_bad|
+--------+-------+-------+--------+
|       1|  user1|  prod1|    good|
|       2|  user1|  prod2|    good|
|       3|  user1|  prod3|    good|
|       4|  user2|  prod3|     bad|
|       5|  user2|  prod4|    good|
|       5|  user2|  prod5|    good|
+--------+-------+-------+--------+
+--------+-------+-------+--------------------+------------------------+
|trans_id|user_id|prod_id|     other_purchases|other_purchases_filtered|
+--------+-------+-------+--------------------+------------------------+
|       1|  user1|  prod1|[prod1, prod2, pr...|          [prod2, prod3]|
|       2|  user1|  prod2|[prod1, prod2, pr...|          [prod1, prod3]|
|       3|  user1|  prod3|[prod1, prod2, pr...|          [prod1, prod2]|
|       4|  user2|  prod3|      [prod4, prod5]|          [prod4, prod5]|
|       5|  user2|  prod4|      [prod4, prod5]|                 [prod5]|
|       5|  user2|  prod5|      [prod4, prod5]|                 [prod4]|
+--------+-------+-------+--------------------+------------------------+