我是scala和spark的新人,并且不知道如何爆炸"路径"字段并找到最大值和最小值" event_dttm"一次通过的领域。 我有一个数据:
val weblog=sc.parallelize(Seq(
("39f0412b4c91","staticnavi.com", Seq( "panel", "cm.html" ), 1424954530, "SO.01"),
("39f0412b4c91","staticnavi.com", Seq( "panel", "cm.html" ), 1424964830, "SO.01"),
("39f0412b4c91","staticnavi.com", Seq( "panel", "cm.html" ), 1424978445, "SO.01"),
)).toDF("id","domain","path","event_dttm","load_src")
我必须得到下一个结果:
"id" | "domain" |"newPath" | "max_time" | min_time | "load_src"
39f0412b4c91|staticnavi.com| panel | 1424978445 | 1424954530 | SO.01
39f0412b4c91|staticnavi.com| cm.html | 1424978445 | 1424954530 | SO.01
我认为通过行功能可以实现,但不知道如何。
答案 0 :(得分:1)
您正在寻找explode()
,然后是groupBy
聚合:
import org.apache.spark.sql.functions.{explode, min, max}
var result = weblog.withColumn("path", explode($"path"))
.groupBy("id","domain","path","load_src")
.agg(min($"event_dttm").as("min_time"),
max($"event_dttm").as("max_time"))
result.show()
+------------+--------------+-------+--------+----------+----------+
| id| domain| path|load_src| min_time| max_time|
+------------+--------------+-------+--------+----------+----------+
|39f0412b4c91|staticnavi.com| panel| SO.01|1424954530|1424978445|
|39f0412b4c91|staticnavi.com|cm.html| SO.01|1424954530|1424978445|
+------------+--------------+-------+--------+----------+----------+