将列值映射到spark中的数字类型

时间:2017-04-24 12:15:30

标签: scala apache-spark

我在火花中有一个df,其结构如下:

amount gender status
1000   male   married
1313   female single
1000   male   married

基本上我想创建性别为数字的新列

amount gender status  gender_num
1000   male   married 1
1313   female single  2
1000   male   married 1

我厌倦了以下事情:

  val gender = df.gender

  val gender_num = gender match {
case male => 1
case female => 2
}

我收到以下错误:

<console>:125: error: value pa_gender_category is not a member of org.apache.spark.sql.DataFrame
val gender = data.pa_gender_category

我知道有一个stringtoindex函数,但我想手动执行此操作

2 个答案:

答案 0 :(得分:9)

使用withColumn

val input = // load input DataFrame
val withGender = input.withColumn("gender_num", when($"gender" === "female", 2).otherwise(1))

您可以链接更多选项:

val withGender = input.withColumn("gender_num", when($"gender" === "female", 2).when($"gender" == "other", 3).otherwise(1))

你也可以在Akash的答案中使用UDF。请注意,有时UDF无法像内置函数那样进行优化,但它们可以更具可读性

答案 1 :(得分:2)

您可以使用Spark的UDF

import org.apache.spark.sql.functions.udf
def genderToNumber: UserDefinedFunction = {
    udf((gender: String) => {
                             gender match {
                                           case "male" => 1
                                           case "female" => 2
                                          }
                          }               })

您可以通过此

应用UDF
   val newDF = df.withColumn("gender_num", genderToNumber(df("gender")))