火花中的第一个功能

时间:2018-11-15 20:37:46

标签: apache-spark dataframe apache-spark-sql

我不确定输出数据帧查询中的first(“ traitvalue”)为什么在下面起作用。first(“ traitvalue”)在这里是什么意思?,请咨询

输入数据框:

   val df = sc.parallelize(List(("1","NA","action","Heavy", "NY"),("1","NA","comedy","light", "NY"),("1","NA","horror","light", "NY"),("1","NA","horror","light", "KY"),("2","NA","horror","light", "NY"))).toDF("ban","yr_mon","genre","traitvalue","state")

+---+------+------+----------+-----+
|ban|yr_mon| genre|traitvalue|state|
+---+------+------+----------+-----+
|  1|    NA|action|     Heavy|   NY|
|  1|    NA|comedy|     light|   NY|
|  1|    NA|horror|     light|   NY|
|  1|    NA|horror|     light|   KY|
|  2|    NA|horror|     light|   NY|
+---+------+------+----------+-----+

输出数据框

df.groupBy($"ban",$"state").pivot("genre").agg(first("traitvalue")).show


+---+-----+------+------+------+
|ban|state|action|comedy|horror|
+---+-----+------+------+------+
|  2|   NY|  null|  null| light|
|  1|   NY| Heavy| light| light|
|  1|   KY|  null|  null| light|
+---+-----+------+------+------+

1 个答案:

答案 0 :(得分:0)

这只是一个小技巧,因为该示例使用带枢轴的agg而不是数值函数。使用分类值,您可能会遇到多个这样的条目,因此它将采用第一个这样的条目。通常没有这样的问题。例如。两个特质。因此,这种方法。