我在spark中有一个UDF,它返回一个Map输出。
Dataset<Row> dataSet = sql.sql("select *, address(col1,col2) as udfoutput from input");
我想将地图中返回的值附加到列。
Ex - 如果输入表有2列且UDF映射返回2个键值对,则应使用数据集创建总共4列。
答案 0 :(得分:2)
怎么样
select
*,
address(col1,col2)['key1'] as key1,
address(col1,col2)['key2'] as key2
from input
或者使用with
只调用一次UDF:
with
raw as (select *, address(col1,col2) address from input)
select
*,
address['key1'],
address['key2']
from raw
这将是 hive 方式。
在spark中,您可以使用Dataset
API使用所有命令转换(而不是声明性SQL)。在Scala中它可能看起来像这样。在Java中,我相信,它有点冗长:
// First your schemas as case classes (POJOs)
case class MyModelClass(col1: String, col2: String)
case class MyModelClassWithAddress(col1: String, col2: String, address: Map[String, String])
// in spark any function is a udf
def address(col1: String, col2: String): Map[String, String] = ???
// Now imperative Spark code
import spark.implicits._
val dataSet: Dataset[Row] = ??? // you can read table from Hive Metastore, or using spark.read ...
dataSet
.as[MyModelClass]
.map(myModel => MyModelWithAddress(myModel.col1, myModel.col1, address(myModel.col1, myModel.col2))
.save(...) //wherever needs to be done later