spark sql udf cast返回值

时间:2018-06-08 12:37:58

标签: scala apache-spark apache-spark-sql

我有一个变量声明如下:

val jobnameSeq = Seq( ("42409245", "name12"),("42461545", "name4"),("42409291", "name1"),("42413872", "name3"),("42417044", "name2"))

我希望能够在Spark SQL中创建一个可用42461545替换name4的函数在sql查询中。 我试图宣布这个功能:

val jobnameDF = jobnameSeq.toDF("jobid","jobname")
sqlContext.udf.register("getJobname", (id: String) => (
     jobnameDF.filter($"jobid" === id).select($"jobname")
    )
)

在sql中使用如下:

select getjobname(jobid), other, field from table  

但是jobnameDF.filter($"jobid" === id).select($"jobname")返回DF而不是字符串,我无法弄清楚如何简单地将此值转换为字符串,因为每次只有一个结果。

如果在这种情况下Seq不是要使用的对象,我可以接受建议。

修改
尽管建议的答案有效,但我在这里做的却是完成这项工作:

#Convert my seq to a hash map
val jobMap = jobnameSeq.toMap
#declare a sql function so I could use it in sparksql (I need to be accessible to people that don't know scala
sqlContext.udf.register("getJobname", (id: String) => (
    jobMap(id)
    )
)

2 个答案:

答案 0 :(得分:2)

您可以通过多种方式实现这一目标:

val jobnameSeq = Seq( ("42409245", "name12"),("42461545", "name4"),
                      ("42409291", "name1"),("42413872", "name3"),("42417044", "name2"))
val jobIdDF = Seq( "42409245",("42409291"),("42409231")).toDF("jobID")
jobIdDF.createOrReplaceTempView("JobView")

只需在作业名称序列上使用普通scala的toMap函数。

sqlContext.udf.register("jobNamelookUp", (jobID: String) =>  
                                            jobnameSeq.toMap.getOrElse(jobID,"null"))

// OR

如果输入是RDD,则使用spark。{/ p>使用collectAsMap

val jobnameMap = sc.parallelize(jobnameSeq).collectAsMap
sqlContext.udf.register("lookupJobName",(jobID:String) => 
                                            jobnameMap.getOrElse(jobID,"null"))

// OR

如果在群集上进行此查找,则可以broadcast

val jobnameMapBC = sc.broadcast(jobnameMap)
sqlContext.udf.register("lookupJobNameBC",(jobID:String) => 
                                                jobnameMapBC.value.getOrElse(jobID,"null")) 

spark.sql("select jobID,jobNamelookUp(jobID) as jobNameUsingMap,
                        lookupJobNameBC(jobID) as jobNameUsingBC,
                        lookupJobName(jobID) as jobNameUsingRDDMap 
         from JobView")
    .show()

+--------+---------------+--------------+------------------+
|   jobID|jobNameUsingMap|jobNameUsingBC|jobNameUsingRDDMap|
+--------+---------------+--------------+------------------+
|42409245|         name12|        name12|            name12|
|42409291|          name1|         name1|             name1|
|42409231|           null|          null|              null|
+--------+---------------+--------------+------------------+    

根据Raphael的建议,使用broadcast-join

import org.apache.spark.sql.functions._
val jobnameSeqDF = jobnameSeq.toDF("jobID","name")
jobIdDF.join(broadcast(jobnameSeqDF), Seq("jobID"),"leftouter").show(false)

+--------+------+
|jobID   |name  |
+--------+------+
|42409245|name12|
|42409291|name1 |
|42409231|null  |
+--------+------+

答案 1 :(得分:1)

根据您的问题我可以理解,您应该从序列中创建Map并直接获取jobId

val simpleMap = jobnameSeq.toMap

println(simpleMap("42461545"))

应该给你name4

现在,如果您想使用dataframe进行测试,可以执行以下操作

val jobnameDF = jobnameSeq.toDF("jobid","jobname")

val jobName = jobnameDF.filter($"jobid" === "42461545").select("jobname").first().getAs[String]("jobname")

println(jobName)

应打印name4