例如,我有这种类型的DataFrame:
val DF = Seq((10, "id1",1),
(20, "id1",6),
(30, "id1",6),
(40, "id1",11),
(50, "id1",1),
(60, "id1",1),
(70, "id1",11),
(10, "id2",1),
(20, "id2",11),
(30, "id2",1),
(40, "id2",6),
(50, "id2",1),
(60, "id2",11),
(70, "id2",6)).toDF("Time", "ID","Channel")
+----+---+-------+
|Time| ID|Channel|
+----+---+-------+
| 10|id1| 1|
| 20|id1| 6|
| 30|id1| 6|
| 40|id1| 11|
| 50|id1| 1|
| 60|id1| 1|
| 70|id1| 11|
| 10|id2| 1|
| 20|id2| 11|
| 30|id2| 1|
| 40|id2| 6|
| 50|id2| 1|
| 60|id2| 11|
| 70|id2| 6|
+----+---+-------+
我想为每个ID计数Channel随时间变化的次数。 得到类似
的结果+---+-----------------------+
| ID|NumberChannelChangement|
+---+-----------------------+
|id1| 4|
|id2| 6|
+---+-----------------------+
我尝试将DataFrame转换为RDD并对其进行迭代。 当我使用相同的输入时,一次运行到另一次运行不会得到相同的结果。
预先感谢您的帮助
答案 0 :(得分:1)
您可以结合使用分析功能(lag
)来检测变化,并使用groupBy
来计算变化:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
df
.withColumn("lag_Channel",lag($"Channel",1).over(Window.partitionBy($"ID").orderBy($"Time")))
.withColumn("change",coalesce($"Channel"=!=$"lag_channel",lit(false)))
.groupBy($"ID")
.agg(
sum(when($"change",lit(1))).as("NumberChannelChangement")
)
.show()
+---+-----------------------+
| ID|NumberChannelChangement|
+---+-----------------------+
|id1| 4|
|id2| 6|
+---+-----------------------+
答案 1 :(得分:1)
使用spark-sql。
df.createOrReplaceTempView("PierreK ")
spark.sql(
""" with t1 (select time,id, channel, lag(channel) over(partition by id order by time) chn_lag from pierrek)
select id, sum( case when chn_lag is null then 0 when channel=chn_lag then 0 else 1 end) as NumberChannelChangement from t1 group by id
""").show(false)
结果:
+---+-----------------------+
|id |NumberChannelChangement|
+---+-----------------------+
|id1|4 |
|id2|6 |
+---+-----------------------+