使用scala进行spark数据帧操作行和列级别

时间:2018-10-19 06:55:22

标签: scala apache-spark dataframe apache-spark-sql

原始数据框
      0.2 0.3

+------+------------- -+
|  name| country |
+------+---------------+
|Raju  |UAS         |
|Ram  |Pak.         |
|null    |China      |
|null    |null          |
+------+--------------+

  I Need  this 
+------+--------------+
|Nwet|wet Con |
+------+--------------+
|0.2   | 0.3           |
|0.2   | 0.3           |
|0.0   | 0.3.          |
|0.0   | 0.0           |
+------+--------------+

我想创建一个Udf。对于Both列
这将应用于名称列,它检查是否不为null,然后返回0.2返回0.0。 和相同的Udf应用于country列,如果它为null,则返回0.0不为null则返回0.3

1 个答案:

答案 0 :(得分:0)

使用Apache的StringUtils:

val transcodificationName: UserDefinedFunction =
    udf { (name: String) => {
        if (StringUtils.isBlank(name)) 0.0
        else 0.2
        }
    }
val transcodificationCountry: UserDefinedFunction =
    udf { (country: String) => {
        if (StringUtils.isBlank(country)) 0.0
        else 0.3
        }
    }

dataframe
    .withColumn("Nwet", transcodificationName(col("name"))).cast(DoubleType)
    .withColumn("wetCon", transcodificationCountry(col("country"))).cast(DoubleType)
    .select("Nwet", "wetcon")

编辑:

val transcodificationColumns: UserDefinedFunction =
        udf { (input: String, columnName:String) => {
                if (StringUtils.isBlank(country)) 0.0
                else if(columnName.equals("name")) 0.2
                else if(columnName.equals("country") 0.3
                else 0.0
            }
        }


    dataframe
        .withColumn("Nwet", transcodificationColumns(col("name"), "name")).cast(DoubleType)
        .withColumn("wetCon", transcodificationColumns(col("country")), "country").cast(DoubleType)
        .select("Nwet", "wetcon")