Question

我有以下架构：

root
 |-- Id: string (nullable = true)
 |-- Desc: string (nullable = true)
 |-- Measurements: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- time: string (nullable = true)
 |    |    |-- metric: string (nullable = true)
 |    |    |-- value: string (nullable = true)

在我的分析中，我希望保持嵌套结构的原样，但是想要在DataFrame中添加包含Measurements中元素数量的列，最小值/最大值/平均值为某些列，特别是value的某些值，例如metric 'temperature'。

在SQLContext中我可以简单地使用sqlContext.sql("SELECT Id, SIZE(Measurements) AS num_entries FROM df"来获取大小，但我想知道是否有一种优雅的方式（在Scala中）来做我想做的事情，即没有创建新的必须根据Id？

重新加入的DataFrame

Answer 1

这里没有通用的方法。可以使用内置函数（array）轻松提取简单度量标准，例如size中的元素数量。

case class Measurement(temperature: Double, speed: Double)

val df = sc.parallelize(Seq(
  (1L, Array(Measurement(0.5, 10.0), Measurement(6.2, 3.7))),
  (2L, Array(Measurement(22.0, 5.0)))
)).toDF("id", "measurements")

df.select($"*", size($"measurements")).show

// +---+--------------------+------------------+
// | id|        measurements|size(measurements)|
// +---+--------------------+------------------+
// |  1|[[0.5,10.0], [6.2...|                 2|
// |  2|        [[22.0,5.0]]|                 1|
// +---+--------------------+------------------+

更复杂的事情需要exploding：

val expanded = df.withColumn("measurement",explode($"measurements"))
val withStats = expanded
 .groupBy($"id")
 .agg(
   avg($"measurement.temperature").alias("avg_temp"),
   avg($"measurement.speed").alias("avg_speed"),
   first($"measurements")) // This assumes a single row per ID!

withStats.show
// +---+--------+---------+---------------------+
// | id|avg_temp|avg_speed|first(measurements)()|
// +---+--------+---------+---------------------+
// |  1|    3.35|     6.85| [[0.5,10.0], [6.2...|
// |  2|    22.0|      5.0|         [[22.0,5.0]]|
// +---+--------+---------+---------------------+

或UDF（你想在PySpark中避免的东西）：

def my_mean(c: String) = udf((xs: Seq[Row]) => 
   Try(xs.map(_.getAs[Double](c)).sum / xs.size).toOption
)

val withAvgTemp = df.withColumn(
  "avg_temperature", my_mean("temperature")($"measurements"))

withAvgTemp.show
// +---+--------------------+---------------+
// | id|        measurements|avg_temperature|
// +---+--------------------+---------------+
// |  1|[[0.5,10.0], [6.2...|           3.35|
// |  2|        [[22.0,5.0]]|           22.0|
// +---+--------------------+---------------+

您也可以尝试Spark DataSets但这些仍远未稳定。

通常，嵌套结构主要用于导入（和可选地导出），否则这些是第二类对象。

注意 （Spark＆lt; 1.5）：

如果您使用较旧版本的Spark，则可以使用selectExpr的上述部分内容（需要HiveContext）：

df.selectExpr("id", "size(measurements) AS n")
df.selectExpr("id", "explode(measurements) AS measurement")

Answer 2

import org.apache.spark.sql.functions._
df.select(df("id"), size(df("Measurements"))).collect

以上应该有效。有关更多内置函数，请按https://spark.apache.org/docs/1.5.0/api/java/org/apache/spark/sql/functions.html

Spark嵌套的JSON聚合

2 个答案: