如何为一次传递爆炸数组[string]字段和组数据

时间:2017-04-05 18:03:13

标签: scala apache-spark

我是scala和spark的新人,并且不知道如何爆炸"路径"字段并找到最大值和最小值" event_dttm"一次通过的领域。 我有一个数据:

val weblog=sc.parallelize(Seq(
  ("39f0412b4c91","staticnavi.com", Seq( "panel", "cm.html" ), 1424954530, "SO.01"),
  ("39f0412b4c91","staticnavi.com", Seq( "panel", "cm.html" ), 1424964830, "SO.01"),
  ("39f0412b4c91","staticnavi.com", Seq( "panel", "cm.html" ), 1424978445, "SO.01"),
   )).toDF("id","domain","path","event_dttm","load_src")

我必须得到下一个结果:

"id"        |   "domain"   |"newPath" | "max_time" | min_time   | "load_src"
39f0412b4c91|staticnavi.com|  panel   | 1424978445 | 1424954530 | SO.01
39f0412b4c91|staticnavi.com|  cm.html | 1424978445 | 1424954530 | SO.01

我认为通过行功能可以实现,但不知道如何。

1 个答案:

答案 0 :(得分:1)

您正在寻找explode(),然后是groupBy聚合:

import org.apache.spark.sql.functions.{explode, min, max}

var result = weblog.withColumn("path", explode($"path"))
  .groupBy("id","domain","path","load_src")
  .agg(min($"event_dttm").as("min_time"),
       max($"event_dttm").as("max_time"))

result.show()
+------------+--------------+-------+--------+----------+----------+
|          id|        domain|   path|load_src|  min_time|  max_time|
+------------+--------------+-------+--------+----------+----------+
|39f0412b4c91|staticnavi.com|  panel|   SO.01|1424954530|1424978445|
|39f0412b4c91|staticnavi.com|cm.html|   SO.01|1424954530|1424978445|
+------------+--------------+-------+--------+----------+----------+