PySpark窗口功能-使用非零值填充数据

时间:2019-12-16 19:31:40

标签: python pyspark window

我有这个数据框:

+----------+-------+----+-----------+
|      Date|Count  |  GR|Count_NEW  |
+----------+-------+----+-----------+
|2012-01-02|     25| 100|         25|
|2012-01-02|    250| 110|        250|
|2012-01-03|     26| 100|         26|
|2012-01-03|    251| 110|        251|
|2012-01-04|     24| 100|         24|
|2012-01-04|    242| 110|        242|
|2012-01-05|     26| 100|         26|
|2012-01-05|    254| 110|        254|
|2012-01-06|      0| 100|          0|
|2012-01-06|    254| 110|        254|
|2012-01-07|     25| 100|         25|
|2012-01-07|    256| 110|        256|
|2012-01-08|     28| 100|         28|
|2012-01-08|      0| 110|          0|
|2012-01-09|     22| 100|         22|
|2012-01-09|    289| 110|        289|
|2012-01-10|     29| 100|         29|
|2012-01-10|    276| 110|        276|
|2012-01-11|     21| 100|         21|
|2012-01-11|    259| 110|        259|
+----------+-------+----+-----------+

您可以使用它来创建DF:

l = [
     ('100', '2012-01-02', 25),
     ('110', '2012-01-02', 250),
     ('100', '2012-01-03', 26),
     ('110', '2012-01-03', 251),
     ('100', '2012-01-04', 24),
     ('110', '2012-01-04', 242),
     ('100', '2012-01-05', 26),
     ('110', '2012-01-05', 254),
     ('100', '2012-01-06', 0),
     ('110', '2012-01-06', 254),
     ('100', '2012-01-07', 25),
     ('110', '2012-01-07', 256),
     ('100', '2012-01-08', 28),
     ('110', '2012-01-08', 0),
     ('100', '2012-01-09', 22),
     ('110', '2012-01-09', 289),
     ('100', '2012-01-10', 29),
     ('110', '2012-01-10', 276),
     ('100', '2012-01-11', 21),
     ('110', '2012-01-11', 259),
     ('100', '2012-01-12', 32),
     ('110', '2012-01-12', 280),
     ('100', '2012-01-13', 39),
     ('110', '2012-01-13', 290)
    ]

rdd = sparkc.parallelize(l)
member = rdd.map(lambda x: Row(GR=x[0], Date=x[1], Count=int(x[2])))

pdf = sqlContext.createDataFrame(member)

Count和Count_NEW是同一列(因此请忽略Count_NEW)。

我想为2012年6月GR = 100填充26  我想将2012年1月GR = 110填充为256

所以看起来像这样...

+----------+-------+----+-----------+
|      Date|Count  |  GR|Count_NEW  |
+----------+-------+----+-----------+
|2012-01-02|     25| 100|         25|
|2012-01-02|    250| 110|        250|
|2012-01-03|     26| 100|         26|
|2012-01-03|    251| 110|        251|
|2012-01-04|     24| 100|         24|
|2012-01-04|    242| 110|        242|
|2012-01-05|     26| 100|         26|
|2012-01-05|    254| 110|        254|
|2012-01-06|     26| 100|          0|
|2012-01-06|    254| 110|        254|
|2012-01-07|     25| 100|         25|
|2012-01-07|    256| 110|        256|
|2012-01-08|     28| 100|         28|
|2012-01-08|    256| 110|          0|
|2012-01-09|     22| 100|         22|
|2012-01-09|    289| 110|        289|
|2012-01-10|     29| 100|         29|
|2012-01-10|    276| 110|        276|
|2012-01-11|     21| 100|         21|
|2012-01-11|    259| 110|        259|
+----------+-------+----+-----------+

这意味着我想用以前的非零值填充...如何使用窗口函数来做到这一点?

我尝试了这个,但是不起作用...

win = Window.partitionBy("GR").orderBy("Date")\
                .rowsBetween(Window.unboundedPreceding, Window.currentRow)

df1 = df1.withColumn("Count", last('Count', True).over(win))

我们非常感谢您的帮助。


@corgiman回答后(非常感谢您的时间和帮助)...

如果数据框是这样的...那么@corgiman的soln不起作用

+-----+----------+---+---------+
|Count|      Date| GR|Count_NEW|
+-----+----------+---+---------+
|   25|2012-01-02|100|       25|
|  250|2012-01-02|110|      250|
|   26|2012-01-03|100|       26|
|  251|2012-01-03|110|      251|
|   24|2012-01-04|100|       24|
|  242|2012-01-04|110|      242|
|   26|2012-01-05|100|       26|
|  254|2012-01-05|110|      254|
|    0|2012-01-06|100|        0|
|  254|2012-01-06|110|      254|
|    0|2012-01-07|100|        0|
|  256|2012-01-07|110|      256|
|   28|2012-01-08|100|       28|
|    0|2012-01-08|110|        0|
|   22|2012-01-09|100|       22|
|  289|2012-01-09|110|      289|
|   29|2012-01-10|100|       29|
|  276|2012-01-10|110|      276|
|   21|2012-01-11|100|       21|
|  259|2012-01-11|110|      259|
+-----+----------+---+---------+

这里GR = 100在2012-01-07和2012-01-06上为0,我希望两者都填充上一个非零值,即在2012-01-05上为26。

所以所需的解决方案就是这个...

+-----+----------+---+---------+
|Count|      Date| GR|Count_NEW|
+-----+----------+---+---------+
|  250|2012-01-02|110|      250|
|  251|2012-01-03|110|      251|
|  242|2012-01-04|110|      242|
|  254|2012-01-05|110|      254|
|  254|2012-01-06|110|      254|
|  256|2012-01-07|110|      256|
|    0|2012-01-08|110|      256|
|  289|2012-01-09|110|      289|
|  276|2012-01-10|110|      276|
|  259|2012-01-11|110|      259|
|  280|2012-01-12|110|      280|
|  290|2012-01-13|110|      290|
|   25|2012-01-02|100|       25|
|   26|2012-01-03|100|       26|
|   24|2012-01-04|100|       24|
|   26|2012-01-05|100|       26|
|    0|2012-01-06|100|       26|
**|    0|2012-01-07|100|       26|**
|   28|2012-01-08|100|       28|
|   22|2012-01-09|100|       22|
+-----+----------+---+---------+

但这是...

+-----+----------+---+---------+
|Count|      Date| GR|Count_NEW|
+-----+----------+---+---------+
|  250|2012-01-02|110|      250|
|  251|2012-01-03|110|      251|
|  242|2012-01-04|110|      242|
|  254|2012-01-05|110|      254|
|  254|2012-01-06|110|      254|
|  256|2012-01-07|110|      256|
|    0|2012-01-08|110|      256|
|  289|2012-01-09|110|      289|
|  276|2012-01-10|110|      276|
|  259|2012-01-11|110|      259|
|  280|2012-01-12|110|      280|
|  290|2012-01-13|110|      290|
|   25|2012-01-02|100|       25|
|   26|2012-01-03|100|       26|
|   24|2012-01-04|100|       24|
|   26|2012-01-05|100|       26|
|    0|2012-01-06|100|       26|
*|    0|2012-01-07|100|        0|*
|   28|2012-01-08|100|       28|
|   22|2012-01-09|100|       22|
+-----+----------+---+---------+

3 个答案:

答案 0 :(得分:1)

您可以将0的值更改为null,并在ignorenulls方法中使用last自变量。

示例代码:


pdf = pdf.withColumn('Count', F.when(pdf['Count'] == 0, F.lit(None)).otherwise(pdf['Count']))

win = Window.partitionBy("GR").orderBy("Date")
s = F.last('Count', ignorenulls = True).over(win)


pdf = pdf.withColumn("Count", F.when(pdf['Count'] == F.lag('Count').over(win), s).otherwise(s)

pdf.show()

输出将是:

+---+----------+-----+
| GR|      Date|Count|
+---+----------+-----+
|110|2012-01-02|  250|
|110|2012-01-03|  251|
|110|2012-01-04|  242|
|110|2012-01-05|  254|
|110|2012-01-06|  254|
|110|2012-01-07|  256|
|110|2012-01-08|  256|
|110|2012-01-09|  289|
|110|2012-01-10|  276|
|110|2012-01-11|  259|
|110|2012-01-12|  280|
|110|2012-01-13|  290|
|100|2012-01-02|   25|
|100|2012-01-03|   26|
|100|2012-01-04|   24|
|100|2012-01-05|   26|
|100|2012-01-06|   26|
|100|2012-01-07|   25|
|100|2012-01-08|   28|
|100|2012-01-09|   22|
+---+----------+-----+

答案 1 :(得分:0)

使用whenotherwise将可以得到想要的东西。

您只需要从以下位置更改代码即可

win = Window.partitionBy("GR").orderBy("Date")\
                .rowsBetween(Window.unboundedPreceding, Window.currentRow)

df1 = df1.withColumn("Count", last('Count', True).over(win))

收件人:

win = Window.partitionBy("GR").orderBy("Date")\
                .rowsBetween(Window.unboundedPreceding, -1)

df1 = df1.withColumn("Count_new", F.when(df1.Count==0, F.last('Count', True).over(win)).otherwise(pdf.Count))

输出将是:

+-----+----------+---+---------+
|Count|      Date| GR|Count_new|
+-----+----------+---+---------+
|  250|2012-01-02|110|      250|
|  251|2012-01-03|110|      251|
|  242|2012-01-04|110|      242|
|  254|2012-01-05|110|      254|
|  254|2012-01-06|110|      254|
|  256|2012-01-07|110|      256|
|    0|2012-01-08|110|      256|
|  289|2012-01-09|110|      289|
|  276|2012-01-10|110|      276|
|  259|2012-01-11|110|      259|
|  280|2012-01-12|110|      280|
|  290|2012-01-13|110|      290|
|   25|2012-01-02|100|       25|
|   26|2012-01-03|100|       26|
|   24|2012-01-04|100|       24|
|   26|2012-01-05|100|       26|
|    0|2012-01-06|100|       26|
|   25|2012-01-07|100|       25|
|   28|2012-01-08|100|       28|
|   22|2012-01-09|100|       22|
+-----+----------+---+---------+

答案 2 :(得分:0)

谢谢你们俩。在此期间,我尝试了此方法,并且效果很好(这与Solat的想法类似-将0转换为null)。这就是我所做的,效果很好!

df1 = pdf.withColumn("Count_NEW", \
                    when(~isnan("Count") & col("Count").isNotNull()\
                        & (col("Count") > 0), col("Count"))\
                               .otherwise(None) )
win = Window.partitionBy("GR").orderBy("Date")\
                .rowsBetween(Window.unboundedPreceding, Window.currentRow)
df1 = df1.withColumn("Count_NEW", last('Count_NEW', True).over(win))