我正在使用具有以下架构的数据框:
root
|-- Id: integer (nullable = true)
|-- defectiveItem: string (nullable = true)
|-- item: struct (nullable = true)
| |-- gem1: integer (nullable = true)
| |-- gem2: integer (nullable = true)
| |-- gem3: integer (nullable = true)
defectiveItem
列包含gem1
,gem2
,gem3
中的值,而item
包含项目的计数。
现在,根据瑕疵项目,我需要将item
中给定瑕疵项目的计数投影为名为count
的新列。
例如,如果defectiveItem
列包含gem1
,而item
包含{"gem1":3,"gem2":4,"gem3":5}
,则结果count
列应包含3。
结果架构应如下:
root
|-- Id: integer (nullable = true)
|-- defectiveItem: string (nullable = true)
|-- item: struct (nullable = true)
| |-- gem1: integer (nullable = true)
| |-- gem2: integer (nullable = true)
| |-- gem3: integer (nullable = true)
|-- count: integer (nullable = true)
答案 0 :(得分:1)
您可以使用udf
函数作为
import org.apache.spark.sql.functions._
def getItemUdf = udf((defectItem: String, item: Row)=> item.getAs[Int](defectItem))
df.withColumn("count", getItemUdf(col("defectiveItem"), col("item"))).show(false)
我希望答案是有用的
答案 1 :(得分:0)
您还可以使用when-case的SQL本机功能以更经典的方法来解决该问题:
import sparkSession.implicits._
val defectiveItems = Seq(
(1, "gem1", Map("gem1" -> 10, "gem2" -> 0, "gem3" -> 0)),
(2, "gem1", Map("gem1" -> 15, "gem2" -> 0, "gem3" -> 0)),
(3, "gem1", Map("gem1" -> 33, "gem2" -> 0, "gem3" -> 0)),
(4, "gem3", Map("gem1" -> 0, "gem2" -> 0, "gem3" -> 2))
).toDF("Id", "defectiveItem", "item")
import org.apache.spark.sql.functions._
val datasetWithCount = defectiveItems.withColumn("count", when($"defectiveItem" === "gem1", $"item.gem1").otherwise(when($"defectiveItem" === "gem2", $"item.gem2").otherwise($"item.gem3")))
println("All items="+datasetWithCount.collectAsList())
它将打印:
All items=[[1,gem1,Map(gem1 -> 10, gem2 -> 0, gem3 -> 0),10], [2,gem1,Map(gem1 -> 15, gem2 -> 0, gem3 -> 0),15], [3,gem1,Map(gem1 -> 33, gem2 -> 0, gem3 -> 0),33], [4,gem3,Map(gem1 -> 0, gem2 -> 0, gem3 -> 2),2]]
通过使用本机解决方案,您可以利用Spark的内部优化来执行计划。