计算平均值和标准偏差,使用火花/标量

时间:2020-03-24 17:19:16

标签: scala apache-spark

我是scala的新手,我不知道该怎么问这种问题(技术性字词...)。我有一个数据框:

id     VehicleID         Longitude    Latitude     Date         Distance
 1       12311            55.55431     25.45631     01/02/2020    20
 2       12311            55.55432     25.45634     01/02/2020    80
 3       12311            55.55433     25.45637     02/02/2020    10
 4       12311            55.55431     25.45621     02/02/2020    50
 5       12309            55.55427     25.45627     01/02/2020    30
 6       12309            55.55436     25.45655     02/02/2020    20
 7       12412            55.55441     25.45657     01/02/2020    14
 8       12412            55.55442     25.45656     02/02/2020    60

我想计算每个块的平均值和标准偏差 例如

VehicleID         Longitude    Latitude     Date         Distance   Mean
12311            55.55431     25.45631     01/02/2020    20          -
12311            55.55432     25.45634     01/02/2020    80          -

VehicleID         Longitude    Latitude     Date         Distance   Mean
 12311            55.55433     25.45637     02/02/2020    10
 12311            55.55431     25.45621     02/02/2020    50


VehicleID         Longitude    Latitude     Date         Distance   Mean
 12309            55.55427     25.45627     01/02/2020    30         -


 VehicleID         Longitude    Latitude     Date         Distance   Mean
 12309            55.55436     25.45655     02/02/2020    20          -

与标准偏差相同

我尝试过,但是对我不起作用

  val w = Window.partitionBy("vehicle_id", "Date").orderBy("id")
  val m =  dataframe_final.withColumn("mean",col("Distance").over(w).cast("double")).as[Double].rdd.mean()

我该怎么做?

谢谢

1 个答案:

答案 0 :(得分:0)

您可以仅使用groupBy来完成此操作:

val groupedMS = df.groupBy("VehicleID","Date")
  .agg(("Distance", "mean"),("Distance", "stddev"))

df.join(groupedMS, Seq("VehicleID","Date"))

给你:

+---------+----------+---+---------+--------+--------+-------------+------------------+
|VehicleID|      Date| id|Longitude|Latitude|Distance|avg(Distance)|  stddev(Distance)|
+---------+----------+---+---------+--------+--------+-------------+------------------+
|    12311|01/02/2020|  1| 55.55431|25.45631|      20|         50.0| 42.42640687119285|
|    12311|01/02/2020|  2| 55.55432|25.45634|      80|         50.0| 42.42640687119285|
|    12311|02/02/2020|  3| 55.55433|25.45637|      10|         30.0|28.284271247461902|
|    12311|02/02/2020|  4| 55.55431|25.45621|      50|         30.0|28.284271247461902|
|    12309|01/02/2020|  5| 55.55427|25.45627|      30|         30.0|               NaN|
|    12309|02/02/2020|  6| 55.55436|25.45655|      20|         20.0|               NaN|
|    12412|01/02/2020|  7| 55.55441|25.45657|      14|         14.0|               NaN|
|    12412|02/02/2020|  8| 55.55442|25.45656|      60|         60.0|               NaN|
+---------+----------+---+---------+--------+--------+-------------+------------------+