Question

我正在尝试使用数据框和PySpark来获取同一组中的上一个值，但是当该组由两列（日期和文本）组成时，我无法使其正常工作

window = Window.partitionBy("date", "text").orderBy("date", "text")
df2 = df2.withColumn('prev_date', func.lag(df2['count']).over(window))

结果：

+--------+----+-----+---+----------+
|    date|text|count|day|prev_count|
+--------+----+-----+---+----------+
|20180901| cat|    2|  1|      null|
|20180901| dog|    2|  1|      null|
|20180902| cat|    3|  2|      null|
|20180902| dog|    6|  2|      null|
|20180903| cat|    2|  3|      null|
|20180904| cat|    3|  4|      null|
|20180905| cat|    2|  5|      null|
|20180905| dog|    4|  5|      null|
+--------+----+-----+---+----------+

所需的输出：

+--------+----+-----+---+----------+
|    date|text|count|day|prev_count|
+--------+----+-----+---+----------+
|20180901| cat|    2|  1|      null|
|20180901| dog|    2|  1|      null|
|20180902| cat|    3|  2|         2|
|20180902| dog|    6|  2|         2|
|20180903| cat|    2|  3|         3|
|20180904| cat|    3|  4|         2|
|20180905| cat|    2|  5|         3|
|20180905| dog|    4|  5|         6|
+--------+----+-----+---+----------+

目标是比较从第一天到前一天的text计数。

谢谢。

Answer 1

我认为您应该在partitionBy语句中删除“日期”字段。数据在“日期”和“文本”字段中是唯一的，因此这意味着不再存在其他相同的组合。这就是为什么所有值都返回null

的原因

>>> from pyspark.sql.window import Window
>>> import pyspark.sql.functions as func
>>> 
>>> data = sc.parallelize([
...     ('20180901','cat',2,1),
...     ('20180901','dog',2,1),
...     ('20180902','cat',3,2),
...     ('20180902','dog',6,2),
...     ('20180903','cat',2,3),
...     ('20180904','cat',3,4),
...     ('20180905','cat',2,5),
...     ('20180905','dog',4,5)])
>>> 
>>> columns = ['date','text','count','day']
>>> df = spark.createDataFrame(data, columns)
>>> 
>>> window = Window.partitionBy('text').orderBy('date','text')
>>> df = df.withColumn('prev_date', func.lag('count').over(window))
>>> 
>>> df.sort('date','text').show()
+--------+----+-----+---+---------+                                             
|    date|text|count|day|prev_date|
+--------+----+-----+---+---------+
|20180901| cat|    2|  1|     null|
|20180901| dog|    2|  1|     null|
|20180902| cat|    3|  2|        2|
|20180902| dog|    6|  2|        2|
|20180903| cat|    2|  3|        3|
|20180904| cat|    3|  4|        2|
|20180905| cat|    2|  5|        3|
|20180905| dog|    4|  5|        6|
+--------+----+-----+---+---------+

PySpark获得上一组的价值

1 个答案: