我正在尝试使用数据框和PySpark来获取同一组中的上一个值,但是当该组由两列(日期和文本)组成时,我无法使其正常工作
window = Window.partitionBy("date", "text").orderBy("date", "text")
df2 = df2.withColumn('prev_date', func.lag(df2['count']).over(window))
结果:
+--------+----+-----+---+----------+
| date|text|count|day|prev_count|
+--------+----+-----+---+----------+
|20180901| cat| 2| 1| null|
|20180901| dog| 2| 1| null|
|20180902| cat| 3| 2| null|
|20180902| dog| 6| 2| null|
|20180903| cat| 2| 3| null|
|20180904| cat| 3| 4| null|
|20180905| cat| 2| 5| null|
|20180905| dog| 4| 5| null|
+--------+----+-----+---+----------+
所需的输出:
+--------+----+-----+---+----------+
| date|text|count|day|prev_count|
+--------+----+-----+---+----------+
|20180901| cat| 2| 1| null|
|20180901| dog| 2| 1| null|
|20180902| cat| 3| 2| 2|
|20180902| dog| 6| 2| 2|
|20180903| cat| 2| 3| 3|
|20180904| cat| 3| 4| 2|
|20180905| cat| 2| 5| 3|
|20180905| dog| 4| 5| 6|
+--------+----+-----+---+----------+
目标是比较从第一天到前一天的text
计数。
谢谢。
答案 0 :(得分:0)
我认为您应该在partitionBy语句中删除“日期”字段。数据在“日期”和“文本”字段中是唯一的,因此这意味着不再存在其他相同的组合。这就是为什么所有值都返回null
的原因>>> from pyspark.sql.window import Window
>>> import pyspark.sql.functions as func
>>>
>>> data = sc.parallelize([
... ('20180901','cat',2,1),
... ('20180901','dog',2,1),
... ('20180902','cat',3,2),
... ('20180902','dog',6,2),
... ('20180903','cat',2,3),
... ('20180904','cat',3,4),
... ('20180905','cat',2,5),
... ('20180905','dog',4,5)])
>>>
>>> columns = ['date','text','count','day']
>>> df = spark.createDataFrame(data, columns)
>>>
>>> window = Window.partitionBy('text').orderBy('date','text')
>>> df = df.withColumn('prev_date', func.lag('count').over(window))
>>>
>>> df.sort('date','text').show()
+--------+----+-----+---+---------+
| date|text|count|day|prev_date|
+--------+----+-----+---+---------+
|20180901| cat| 2| 1| null|
|20180901| dog| 2| 1| null|
|20180902| cat| 3| 2| 2|
|20180902| dog| 6| 2| 2|
|20180903| cat| 2| 3| 3|
|20180904| cat| 3| 4| 2|
|20180905| cat| 2| 5| 3|
|20180905| dog| 4| 5| 6|
+--------+----+-----+---+---------+