Spark - 如何在Seq [Map <string,string>]中的单个字段上应用udf

时间:2018-04-03 06:12:27

标签: scala apache-spark apache-spark-sql user-defined-functions

我有一个带有两列String和Seq [Map [String,String]]类型的Dataframe。类似的东西:

Name    Contact
Alan    [(Map(number -> 12345   , type -> home)),   (Map(number -> 87878787 , type -> mobile))]
Ben     [(Map(number -> 94837593    , type -> job)),(Map(number -> 346      , type -> home))]

所以我需要的是在每个Map [String,String]中的字段udf上应用number o数组中的每个元素。这个udf基本上会转换为0000任何长度小于6的number。像这样:

def valid_num_udf = 
udf((numb:String) =>
{ 
if(numb.length < 6)
   "0000"
else 
    numb 
})

预期结果如下:

NAME    CONTACT
Alan    [(Map(number -> 0000    , type -> home)),   (Map(number -> 87878787 , type -> mobile))]
Ben     [(Map(number -> 94837593    , type -> job)),(Map(number -> 0000     , type -> home))]

我想要的是使用其他udf来访问每个number字段,然后应用valid_num_udf()

我正在尝试这样的事情,但我不知道在Scala中执行此操作的正确语法是什么。

val newDf = Df.withColumn("VALID_CONTACT", myUdf($"CONTACT"))

//This part is really really wrong, but don't know better
def myUdf = udf[Seq[Map[String, String]], Seq[Map[String, String]]] { 
    inputSeq => inputSeq.map(_.get("number") => valid_num_udf(_.get("number")))
}

有人能告诉我如何只访问地图中的那一个字段,而地图中的其他字段是否保持不变?

更新:DataFrame的架构为

root
 |-- NAME: string (nullable = true)
 |-- CONTACT: array (nullable = true)
 |    |-- element: map (containsNull = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)

org.apache.spark.sql.types.StructType = StructType(StructField(NAME,StringType,true), StructField(CONTACT,ArrayType(MapType(StringType,StringType,true),true),true))

3 个答案:

答案 0 :(得分:2)

UDF的签名稍有不正确。您将Seq[Map[String, String]]作为输入传递:

val validNumber = udf{ (xs: Seq[Map[String, String]]) => 
                        xs.map{ x => 
                          if (x("number").length < 6) 
                             Map("number" -> "0000" , "type" -> x("type")) 
                          else x }
                     }

 df.show(false)
+----+-----------------------------------------------------------------------------+
|name|contact                                                                      |
+----+-----------------------------------------------------------------------------+
|Alan|[Map(number -> 6789, type -> home), Map(number -> 987654321, type -> mobile)]|
+----+-----------------------------------------------------------------------------+


df.select(validNumber($"contact") ).show(false)
+-----------------------------------------------------------------------------+
|UDF(contact)                                                                 |
+-----------------------------------------------------------------------------+
|[Map(number -> 0000, type -> home), Map(number -> 987654321, type -> mobile)]|
+-----------------------------------------------------------------------------+

答案 1 :(得分:1)

您可以使用一个将整个UDF作为输入并对其进行转换的单个Seq[Map[String, String]],而不是创建两个单独的UDF。这应该比将它作为两个单独的val valid_num_udf = udf((seq: Seq[Map[String, String]]) => { seq.map{ m => m.get("number") match { case Some(number) if number.length < 6 => m + ("number" -> "0000") case _ => m } } }) 更快更好。

df.withColumn("Contact", valid_num_udf($"Contact"))

使用提供的数据框:

+----+----------------------------------------------------------------------------+
|Name|Contact                                                                     |
+----+----------------------------------------------------------------------------+
|Alan|[Map(number -> 0000, type -> home), Map(number -> 87878787, type -> mobile)]|
|Ben |[Map(number -> 94837593, type -> job), Map(number -> 0000, type -> home)]   |
+----+----------------------------------------------------------------------------+

将给出

UDF

要将逻辑与其余逻辑分开,您不需要调用单独的def valid_num(number: String) = if (number.length < 6) "0000" else number val myUdf = udf((seq: Seq[Map[String, String]]) => { seq.map{ m => m.get("number") match { case Some(number) => m + ("number" -> valid_num(number)) case _ => m } } }) ,只需将逻辑添加到方法并调用它即可。例如,

EditText

答案 2 :(得分:1)

udf函数需要将一个列作为参数传递,这些参数将通过序列化和反序列化转换为原始数据类型。所以当列值达到udf函数时,它们已经是原始数据类型。所以你不能从udf函数调用另一个udf函数,除非你将基元类型转换为列类型。

你可以做什么而不是定义和调用另一个udf函数只是定义一个简单的函数并从udf函数调用该函数

import org.apache.spark.sql.functions._
def valid_num_udf(number: String) = number.length < 6 match{
  case true => "0000"
  case false => number
}
def myUdf = udf((inputSeq: Seq[Map[String, String]]) => {
  inputSeq.map(x => Map("number" -> valid_num_udf(x("number")), "type"-> x("type")))
})

然后只需从udf api

调用withColumn函数
val newDf = Df.withColumn("VALID_CONTACT", myUdf($"Contact"))