我有一个变量声明如下:
val jobnameSeq = Seq( ("42409245", "name12"),("42461545", "name4"),("42409291", "name1"),("42413872", "name3"),("42417044", "name2"))
我希望能够在Spark SQL中创建一个可用42461545
替换name4
的函数在sql查询中。
我试图宣布这个功能:
val jobnameDF = jobnameSeq.toDF("jobid","jobname")
sqlContext.udf.register("getJobname", (id: String) => (
jobnameDF.filter($"jobid" === id).select($"jobname")
)
)
在sql中使用如下:
select getjobname(jobid), other, field from table
但是jobnameDF.filter($"jobid" === id).select($"jobname")
返回DF而不是字符串,我无法弄清楚如何简单地将此值转换为字符串,因为每次只有一个结果。
如果在这种情况下Seq
不是要使用的对象,我可以接受建议。
修改
尽管建议的答案有效,但我在这里做的却是完成这项工作:
#Convert my seq to a hash map
val jobMap = jobnameSeq.toMap
#declare a sql function so I could use it in sparksql (I need to be accessible to people that don't know scala
sqlContext.udf.register("getJobname", (id: String) => (
jobMap(id)
)
)
答案 0 :(得分:2)
您可以通过多种方式实现这一目标:
val jobnameSeq = Seq( ("42409245", "name12"),("42461545", "name4"),
("42409291", "name1"),("42413872", "name3"),("42417044", "name2"))
val jobIdDF = Seq( "42409245",("42409291"),("42409231")).toDF("jobID")
jobIdDF.createOrReplaceTempView("JobView")
只需在作业名称序列上使用普通scala的toMap
函数。
sqlContext.udf.register("jobNamelookUp", (jobID: String) =>
jobnameSeq.toMap.getOrElse(jobID,"null"))
// OR
如果输入是RDD,则使用spark。{/ p>使用collectAsMap
val jobnameMap = sc.parallelize(jobnameSeq).collectAsMap
sqlContext.udf.register("lookupJobName",(jobID:String) =>
jobnameMap.getOrElse(jobID,"null"))
// OR
如果在群集上进行此查找,则可以broadcast
。
val jobnameMapBC = sc.broadcast(jobnameMap)
sqlContext.udf.register("lookupJobNameBC",(jobID:String) =>
jobnameMapBC.value.getOrElse(jobID,"null"))
spark.sql("select jobID,jobNamelookUp(jobID) as jobNameUsingMap,
lookupJobNameBC(jobID) as jobNameUsingBC,
lookupJobName(jobID) as jobNameUsingRDDMap
from JobView")
.show()
+--------+---------------+--------------+------------------+
| jobID|jobNameUsingMap|jobNameUsingBC|jobNameUsingRDDMap|
+--------+---------------+--------------+------------------+
|42409245| name12| name12| name12|
|42409291| name1| name1| name1|
|42409231| null| null| null|
+--------+---------------+--------------+------------------+
根据Raphael
的建议,使用broadcast-join
:
import org.apache.spark.sql.functions._
val jobnameSeqDF = jobnameSeq.toDF("jobID","name")
jobIdDF.join(broadcast(jobnameSeqDF), Seq("jobID"),"leftouter").show(false)
+--------+------+
|jobID |name |
+--------+------+
|42409245|name12|
|42409291|name1 |
|42409231|null |
+--------+------+
答案 1 :(得分:1)
根据您的问题我可以理解,您应该从序列中创建Map
并直接获取jobId
val simpleMap = jobnameSeq.toMap
println(simpleMap("42461545"))
应该给你name4
现在,如果您想使用dataframe
进行测试,可以执行以下操作
val jobnameDF = jobnameSeq.toDF("jobid","jobname")
val jobName = jobnameDF.filter($"jobid" === "42461545").select("jobname").first().getAs[String]("jobname")
println(jobName)
应打印name4