scala火花计数随数据框架的变化而变化的时间

时间:2019-01-11 11:24:22

标签: apache-spark apache-spark-sql

例如,我有这种类型的DataFrame:

val DF = Seq((10, "id1",1), 
(20, "id1",6), 
(30, "id1",6), 
(40, "id1",11), 
(50, "id1",1), 
(60, "id1",1), 
(70, "id1",11),
(10, "id2",1), 
(20, "id2",11), 
(30, "id2",1), 
(40, "id2",6), 
(50, "id2",1), 
(60, "id2",11), 
(70, "id2",6)).toDF("Time", "ID","Channel")

+----+---+-------+
|Time| ID|Channel|
+----+---+-------+
|  10|id1|      1|
|  20|id1|      6|
|  30|id1|      6|
|  40|id1|     11|
|  50|id1|      1|
|  60|id1|      1|
|  70|id1|     11|
|  10|id2|      1|
|  20|id2|     11|
|  30|id2|      1|
|  40|id2|      6|
|  50|id2|      1|
|  60|id2|     11|
|  70|id2|      6|
+----+---+-------+

我想为每个ID计数Channel随时间变化的次数。 得到类似

的结果
+---+-----------------------+
| ID|NumberChannelChangement|
+---+-----------------------+
|id1|                      4|
|id2|                      6|
+---+-----------------------+

我尝试将DataFrame转换为RDD并对其进行迭代。 当我使用相同的输入时,一次运行到另一次运行不会得到相同的结果。

预先感谢您的帮助

2 个答案:

答案 0 :(得分:1)

您可以结合使用分析功能(lag)来检测变化,并使用groupBy来计算变化:

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

df
.withColumn("lag_Channel",lag($"Channel",1).over(Window.partitionBy($"ID").orderBy($"Time")))
.withColumn("change",coalesce($"Channel"=!=$"lag_channel",lit(false)))
.groupBy($"ID")
.agg(
  sum(when($"change",lit(1))).as("NumberChannelChangement")
)
.show()

+---+-----------------------+
| ID|NumberChannelChangement|
+---+-----------------------+
|id1|                      4|
|id2|                      6|
+---+-----------------------+

答案 1 :(得分:1)

使用spark-sql。

df.createOrReplaceTempView("PierreK ")
spark.sql(
  """  with t1 (select time,id, channel, lag(channel) over(partition by id order by time) chn_lag from pierrek)
       select id, sum( case when chn_lag is null then 0 when channel=chn_lag then 0 else 1 end) as NumberChannelChangement from t1 group by id
  """).show(false)

结果:

+---+-----------------------+
|id |NumberChannelChangement|
+---+-----------------------+
|id1|4                      |
|id2|6                      |
+---+-----------------------+