Question

这个问题的标题不清楚，但我不知道如何制定它...... 我有一个数据框，可以在每分钟解释是否有一个skilift队列

数据框包含2个cols： - 我们正在看的分钟=分钟 - 如果此时有队列，则队列= 1，否则为0

前：

from 08h00 to 10h00 each line get a 0 into "Queue"
from 10h01 to 10h45 each line get a 1 into "Queue"
from 10h46 to 14h00 each line get a 0 into "Queue"
from 14h01 to 14h45 each line get a 1 into "Queue"
from 10h45 to 17h30 each line get a 0 into "Queue"

我想创建一个包含2列的新数据框

----------------------
Start      |    End
----------------------
10h01      |   10h45
14h01      |   14h45

我设法获得这样的数据框：

----------------------
Start      |    End
----------------------
10h01      |   None
None       |   10h45
14h01      |   None
None       |   14h45

使用：

df2=df.withColumn('start', F.when((F.col("Prev_Queue") == 0) & (F.col("Queue") == 1), F.col('NextMin')).otherwise(None))

df2=df2.withColumn('end', F.when((F.col("Next_Queue") == 0) & (F.col("Queue") == 1), F.col('NextMin')).otherwise(None))

其中＆＃34; Prev_Queue＆＃34;是前一分钟的队列值和＆＃34; Next_Queue＆＃34;是下一分钟Queue的值。

有关如何获取我想要的数据帧的任何想法（来自我设法获得的数据帧或更简单的方法）？在此先感谢： - ）

Answer 1

我得到了同事的帮助; - ）

有关信息，我还有一个专栏“skilift”，其中包含我感兴趣的skilift的名称

以下是解决方案：

w = Window.partitionBy('Skilift').orderBy('Minute')
df = df.withColumn("rnk", F.when(F.lag('Queue').over(w) != F.col('Queue'), 1).otherwise(0))\
.withColumn('rnk2', F.sum('rnk').over(w))

df.where("queue = 1").groupBy('skilift','rnk2').agg(F.min('Minute'), F.max('Minute')).drop('rnk2').drop('rnk').show(truncate=False)

获取系列的第一次/最后一次出现的日期

1 个答案: