我有一个带有两列String和Seq [Map [String,String]]类型的Dataframe。类似的东西:
Name Contact
Alan [(Map(number -> 12345 , type -> home)), (Map(number -> 87878787 , type -> mobile))]
Ben [(Map(number -> 94837593 , type -> job)),(Map(number -> 346 , type -> home))]
所以我需要的是在每个Map [String,String]中的字段udf
上应用number
o数组中的每个元素。这个udf
基本上会转换为0000任何长度小于6的number
。像这样:
def valid_num_udf =
udf((numb:String) =>
{
if(numb.length < 6)
"0000"
else
numb
})
预期结果如下:
NAME CONTACT
Alan [(Map(number -> 0000 , type -> home)), (Map(number -> 87878787 , type -> mobile))]
Ben [(Map(number -> 94837593 , type -> job)),(Map(number -> 0000 , type -> home))]
我想要的是使用其他udf来访问每个number
字段,然后应用valid_num_udf()
我正在尝试这样的事情,但我不知道在Scala中执行此操作的正确语法是什么。
val newDf = Df.withColumn("VALID_CONTACT", myUdf($"CONTACT"))
//This part is really really wrong, but don't know better
def myUdf = udf[Seq[Map[String, String]], Seq[Map[String, String]]] {
inputSeq => inputSeq.map(_.get("number") => valid_num_udf(_.get("number")))
}
有人能告诉我如何只访问地图中的那一个字段,而地图中的其他字段是否保持不变?
更新:DataFrame的架构为
root
|-- NAME: string (nullable = true)
|-- CONTACT: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
或
org.apache.spark.sql.types.StructType = StructType(StructField(NAME,StringType,true), StructField(CONTACT,ArrayType(MapType(StringType,StringType,true),true),true))
答案 0 :(得分:2)
UDF的签名稍有不正确。您将Seq[Map[String, String]]
作为输入传递:
val validNumber = udf{ (xs: Seq[Map[String, String]]) =>
xs.map{ x =>
if (x("number").length < 6)
Map("number" -> "0000" , "type" -> x("type"))
else x }
}
df.show(false)
+----+-----------------------------------------------------------------------------+
|name|contact |
+----+-----------------------------------------------------------------------------+
|Alan|[Map(number -> 6789, type -> home), Map(number -> 987654321, type -> mobile)]|
+----+-----------------------------------------------------------------------------+
df.select(validNumber($"contact") ).show(false)
+-----------------------------------------------------------------------------+
|UDF(contact) |
+-----------------------------------------------------------------------------+
|[Map(number -> 0000, type -> home), Map(number -> 987654321, type -> mobile)]|
+-----------------------------------------------------------------------------+
答案 1 :(得分:1)
您可以使用一个将整个UDF
作为输入并对其进行转换的单个Seq[Map[String, String]]
,而不是创建两个单独的UDF
。这应该比将它作为两个单独的val valid_num_udf = udf((seq: Seq[Map[String, String]]) => {
seq.map{ m =>
m.get("number") match {
case Some(number) if number.length < 6 => m + ("number" -> "0000")
case _ => m
}
}
})
更快更好。
df.withColumn("Contact", valid_num_udf($"Contact"))
使用提供的数据框:
+----+----------------------------------------------------------------------------+
|Name|Contact |
+----+----------------------------------------------------------------------------+
|Alan|[Map(number -> 0000, type -> home), Map(number -> 87878787, type -> mobile)]|
|Ben |[Map(number -> 94837593, type -> job), Map(number -> 0000, type -> home)] |
+----+----------------------------------------------------------------------------+
将给出
UDF
要将逻辑与其余逻辑分开,您不需要调用单独的def valid_num(number: String) =
if (number.length < 6)
"0000"
else
number
val myUdf = udf((seq: Seq[Map[String, String]]) => {
seq.map{ m =>
m.get("number") match {
case Some(number) => m + ("number" -> valid_num(number))
case _ => m
}
}
})
,只需将逻辑添加到方法并调用它即可。例如,
EditText
答案 2 :(得分:1)
udf函数需要将一个列作为参数传递,这些参数将通过序列化和反序列化转换为原始数据类型。所以当列值达到udf函数时,它们已经是原始数据类型。所以你不能从udf函数调用另一个udf函数,除非你将基元类型转换为列类型。
你可以做什么而不是定义和调用另一个udf函数只是定义一个简单的函数并从udf函数调用该函数
import org.apache.spark.sql.functions._
def valid_num_udf(number: String) = number.length < 6 match{
case true => "0000"
case false => number
}
def myUdf = udf((inputSeq: Seq[Map[String, String]]) => {
inputSeq.map(x => Map("number" -> valid_num_udf(x("number")), "type"-> x("type")))
})
然后只需从udf
api
withColumn
函数
val newDf = Df.withColumn("VALID_CONTACT", myUdf($"Contact"))