Question

我在pyspark数据框中有数据（这是一个有900M行的非常大的表）

数据框包含具有以下值的列：

+---------------+
|prev_display_id|
+---------------+
|           null|
|           null|
|           1062|
|           null|
|           null|
|           null|
|           null|
|       18882624|
|       11381128|
|           null|
|           null|
|           null|
|           null|
|           2779|
|           null|
|           null|
|           null|
|           null|
+---------------+

我正在尝试根据此列生成一个新列，如下所示：

+---------------+------+
|prev_display_id|result|
+---------------+------+
|           null|     0|
|           null|     1|
|           1062|     0|
|           null|     1|
|           null|     2|
|           null|     3|
|           null|     4|
|       18882624|     0|
|       11381128|     0|
|           null|     1|
|           null|     2|
|           null|     3|
|           null|     4|
|           2779|     0|
|           null|     1|
|           null|     2|
|           null|     3|
|           null|     4|
+---------------+------+

新列的功能类似于：

new_col = 0 if (prev_display_id!=null) else col = col+1

col就像一个运行中的计数器，当遇到非null值时会重置为零。

如何在pyspark中有效地做到这一点？

更新

我尝试了下面@anki建议的解决方案。我适用于小型数据集，但是会产生此错误：

WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.

不幸的是，对于我的大型数据集，它似乎杀死了群集。在具有2个rd5.2xlarge数据节点的大数据集上运行时的错误，请参见下图：

有什么办法解决这个问题吗？

Answer 1

据我了解，您可以使用SELECT T2.* FROM table_1 T1 JOIN table_1 T2 ON T2.ID = T1.ID WHERE T1.result LIKE '%A%' AND T1.date IS NOT NULL AND T2.result LIKE '%B%' AND T2.date IS NOT NULL创建一个id列，然后对monotonically_increasing_id不为null的情况在窗口上求和，然后取该列划分的行号减去1 ：

prev_display_id

w = Window.orderBy(F.monotonically_increasing_id())
w1 = F.sum((F.col("prev_display_id").isNotNull()).cast("integer")).over(w)

(df.withColumn("result",F.row_number()
 .over(Window.partitionBy(w1).orderBy(w1))-1).drop("idx")).show()

Answer 2

您可以通过运行以下命令来获取此信息：

window = Window.orderBy(f.monotonically_increasing_id())
df.withColumn('row',f.row_number().over(window))\
.withColumn('ne',f.when(f.col('consumer_id').isNotNull(),f.col('row')))\
.withColumn('result',f.when(f.col('ne').isNull(),f.col('row')-f.when(f.last('ne',ignorenulls=True)\
.over(window).isNull(),1).otherwise(f.last('ne',ignorenulls=True).over(window))).otherwise(0))\
.drop('row','ne').show()

+-----------+------+
|consumer_id|result|
+-----------+------+
|       null|     0|
|       null|     1|
|       null|     2|
|         11|     0|
|         11|     0|
|       null|     1|
|       null|     2|
|         12|     0|
|         12|     0|
+-----------+------+

Pyspark基于其他列和运行计数器添加列

更新

2 个答案: