Question

我有一个问题。我有一个火花数据框，其中有几列如下：

id颜色
1红色，蓝色，黑色
2红色，绿色
3蓝色，黄色，绿色
...

我还有一个类似的地图文件：
红色，0
蓝色，1个
绿色，2
黑色，3
黄色，4

我需要做的是将颜色名称映射到不同的ID，例如映射＆＃34;红色，蓝色，黑色＆＃34;成为[1,1,0,1,0]的数组。我这样编写代码：

def mapColor(label_string:String):Array[Int]={
var labels = label_string.split(",")
var index_array = new Array[Int](COLOR_LENGTH)
for (label<-labels){
  if(COLOR_MAP.contains(label)){
    index_array(COLOR_MAP(label))=1
  }
  else{
    //dictionary does not contain the label, the last index set to be one
    index_array(COLOR_LENGTH-1)=1
  }
}
index_array 
}

COLOR_LENGTH是字典的长度，COLOR_MAP是包含字符串 - ＆gt; id关系的字典。

我这样称呼这个函数：

 val color_function = udf(mapColor:(String)=>Array[Int])
 sql.withColumn("color_idx",color_function(col("Color")))

由于我有多个列需要此操作，但不同的列需要不同的字典。目前，我为每一列复制了这个函数（只需更改字典和长度信息）。但代码看起来很单调乏味。有没有方法，我可以将长度和字典传递给映射函数，例如

def map(label_string:String,map:Map[String,Integer],len:Int):Array[Int]

但是我应该如何在spark数据帧中调用此函数？因为我无法在声明中传递参数

val color_function = udf(mapColor:(String)=>Array[Int])

Answer 1

您可以使用颜色Map附带的UDF作为基本参数，如下例所示：

val df = Seq(
  (1, "Red, Blue, Black"),
  (2, "Red, Green"),
  (3, "Blue, Yellow, Green")
).toDF("id", "color")

val colorMap = Map("Red"-> 0, "Blue"->1, "Green"->2, "Black"->3, "Yellow"->4)

def mapColorCode(m: Map[String, Int]) = udf( (s: String) =>
  s.split("""\s*,\s*""").map(c => m.getOrElse(c, -99))
)

df.select($"id", mapColorCode(colorMap)($"color").as("colorcode")).show
// +---+----------+
// | id| colorcode|
// +---+----------+
// |  1| [0, 1, 3]|
// |  2|    [0, 2]|
// |  3| [1, 4, 2]|
// +---+----------+

Answer 2

以下是简洁的完整代码 -

val colrMapList = List("Red" -> 0, "Blue" -> 1, "Green" -> 2).toMap

def getColor = udf((colors: Seq[String]) => { if(!colors.isEmpty) colors.map(color => colrMapList.getOrElse(color,"0")).mkString(",") else "0"  } )

val colors = List((1, Array("Red","Blue","Black")),(2,Array("Red", "Green")))
val colrDF = sc.parallelize(colors).toDF

colrDF.withColumn("colorMap", getColor($"colors")).show

<强>解释

为颜色到整数映射创建map。
getColor函数拉出相应的整数给出颜色
最后，您应用colrDF的功能来获取输出

如何在Spark Udf中传递地图？

2 个答案: