SparkSQL:如何根据列名选择列值

时间:2018-07-01 18:55:02

标签: scala apache-spark apache-spark-sql

我正在使用具有以下架构的数据框:

root
 |-- Id: integer (nullable = true)
 |-- defectiveItem: string (nullable = true)
 |-- item: struct (nullable = true)
 |    |-- gem1: integer (nullable = true)
 |    |-- gem2: integer (nullable = true)
 |    |-- gem3: integer (nullable = true)

defectiveItem列包含gem1gem2gem3中的值,而item包含项目的计数。 现在,根据瑕疵项目,我需要将item中给定瑕疵项目的计数投影为名为count的新列。

例如,如果defectiveItem列包含gem1,而item包含{"gem1":3,"gem2":4,"gem3":5},则结果count列应包含3。

结果架构应如下:

root
     |-- Id: integer (nullable = true)
     |-- defectiveItem: string (nullable = true)
     |-- item: struct (nullable = true)
     |    |-- gem1: integer (nullable = true)
     |    |-- gem2: integer (nullable = true)
     |    |-- gem3: integer (nullable = true)
     |-- count: integer (nullable = true)

2 个答案:

答案 0 :(得分:1)

您可以使用udf函数作为

获得所需的输出数据帧。
import org.apache.spark.sql.functions._
def getItemUdf = udf((defectItem: String, item: Row)=> item.getAs[Int](defectItem))

df.withColumn("count", getItemUdf(col("defectiveItem"), col("item"))).show(false)

我希望答案是有用的

答案 1 :(得分:0)

您还可以使用when-case的SQL本机功能以更经典的方法来解决该问题:

import sparkSession.implicits._

val defectiveItems = Seq(
(1, "gem1", Map("gem1" -> 10, "gem2" -> 0, "gem3" -> 0)),
(2, "gem1", Map("gem1" -> 15, "gem2" -> 0, "gem3" -> 0)),
(3, "gem1", Map("gem1" -> 33, "gem2" -> 0, "gem3" -> 0)),
(4, "gem3", Map("gem1" -> 0, "gem2" -> 0, "gem3" -> 2))
).toDF("Id", "defectiveItem", "item")
import org.apache.spark.sql.functions._
val datasetWithCount = defectiveItems.withColumn("count", when($"defectiveItem" === "gem1", $"item.gem1").otherwise(when($"defectiveItem" === "gem2", $"item.gem2").otherwise($"item.gem3")))

println("All items="+datasetWithCount.collectAsList())

它将打印:

All items=[[1,gem1,Map(gem1 -> 10, gem2 -> 0, gem3 -> 0),10], [2,gem1,Map(gem1 -> 15, gem2 -> 0, gem3 -> 0),15], [3,gem1,Map(gem1 -> 33, gem2 -> 0, gem3 -> 0),33], [4,gem3,Map(gem1 -> 0, gem2 -> 0, gem3 -> 2),2]]

通过使用本机解决方案,您可以利用Spark的内部优化来执行计划。